feat: schema versioning policy (closes #368)#382
Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
Implements a schema versioning policy for waza public artifacts (notably eval.yaml and results.json), adding schemaVersion with backward-compatible defaults and reader behavior (same-major compatibility with warnings; cross-major rejection with migration guidance). It also introduces initial compatibility fixtures/tests, documentation for the policy, and a stub waza migrate <file> command.
Changes:
- Add
schemaVersionto public artifacts/models, defaulting missing versions to1.0, and enforce cross-major incompatibility. - Add compatibility fixtures/tests under
internal/validation/testdata/to ensure older same-major artifacts remain readable. - Document the policy in the docs site and add a stub
waza migrate <file>CLI command for future major migrations.
Show a summary per file
| File | Description |
|---|---|
| web/src/hooks/useSSE.ts | Adds optional schemaVersion to the SSE event envelope type. |
| web/dist/index.html | Updates embedded dashboard entrypoint asset hash reference. |
| site/src/content/docs/reference/schema.mdx | Documents schemaVersion in the eval.yaml schema reference. |
| site/src/content/docs/reference/schema-changes.md | Adds a schema policy + changelog page for versioned artifacts. |
| site/src/content/docs/reference/cli.mdx | Documents the new waza migrate command. |
| site/src/content/docs/guides/eval-yaml.mdx | Adds schemaVersion to examples and explains the policy. |
| site/astro.config.mjs | Adds “Schema Changes” to the docs navigation. |
| schemas/eval.schema.json | Adds schemaVersion property to the JSON schema for eval.yaml. |
| README.md | Documents schemaVersion and the new waza migrate command. |
| internal/webapi/store.go | Loads outcomes via schema-aware parsing to enforce compatibility rules. |
| internal/webapi/additional_test.go | Minor test control-flow tweak (adds return after t.Fatal). |
| internal/validation/testdata/results-1.0.json | Adds a results fixture for schema 1.0. |
| internal/validation/testdata/eval-1.0.yaml | Adds an eval fixture for schema 1.0. |
| internal/validation/compatibility_test.go | Adds tests ensuring schema fixtures load and tolerate same-major unknown fields. |
| internal/storage/local.go | Switches local store loading to schema-aware outcome parsing. |
| internal/storage/azure_blob.go | Switches Azure blob download parsing to schema-aware outcome parsing. |
| internal/scaffold/scaffold.go | Adds schemaVersion: "1.0" to scaffolded eval YAML output. |
| internal/scaffold/scaffold_test.go | Verifies scaffolded eval YAML includes schemaVersion. |
| internal/models/testcase.go | Replaces strict YAML unknown-field rejection with same-major warning behavior. |
| internal/models/testcase_test.go | Updates test expectations: unknown fields are now tolerated for compatibility. |
| internal/models/spec.go | Adds SchemaVersion to EvalSpec and implements schema-version validation + unknown-field warning path. |
| internal/models/spec_test.go | Adds/updates tests for defaulting schemaVersion and rejecting different majors. |
| internal/models/schema_version.go | Introduces schema version parsing/validation and unknown-field warning helpers; adds schema-aware results parsing. |
| internal/models/outcome.go | Adds SchemaVersion to EvaluationOutcome and defaults it in JSON marshaling. |
| internal/models/outcome_schema_test.go | Adds tests for results schema defaulting/compatibility behavior. |
| internal/models/grader_validation_test.go | Updates expected error message text related to missing assertions. |
| internal/models/grader_params.go | Changes grader param decoding to warn (not error) on unknown YAML fields. |
| internal/mcp/coverage_test.go | Minor test control-flow tweaks (adds return after t.Fatal). |
| examples/repo-resources/eval.yaml | Adds schemaVersion: "1.0" to example eval. |
| examples/grader-showcase/eval.yaml | Adds schemaVersion: "1.0" to example eval. |
| examples/custom-agent/eval.yaml | Adds schemaVersion: "1.0" to example eval. |
| examples/code-explainer/eval.yaml | Adds schemaVersion: "1.0" to example eval. |
| docs/PRD.md | Adds a PRD entry capturing schema-versioned public artifacts. |
| cmd/waza/root.go | Registers the new migrate command. |
| cmd/waza/cmd_migrate.go | Adds stub waza migrate <file> command implementation. |
| cmd/waza/cmd_migrate_test.go | Adds tests for migrate command no-op and missing-file error. |
| cmd/waza/cmd_grade.go | Uses schema-aware parsing for results input when grading. |
| cmd/waza/cmd_compare.go | Uses schema-aware results loading for comparisons. |
Review details
- Files reviewed: 37/38 changed files
- Comments generated: 3
- Review effort level: Low
08b1205 to
654c54c
Compare
654c54c to
f6cbcb2
Compare
f6cbcb2 to
bff7007
Compare
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
bff7007 to
6b78b0d
Compare
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comment on lines
+89
to
+92
| ok, err := isOutcomeJSON(data) | ||
| if err != nil { | ||
| return nil | ||
| } |
Comment on lines
+171
to
+174
| _, ok, err := models.ProbeEvaluationOutcomeSchemaVersion(data) | ||
| if err != nil { | ||
| return nil | ||
| } |
Comment on lines
+56
to
+80
| func readArtifactSchemaVersion(path string, data []byte) (artifact string, version string, err error) { | ||
| switch filepath.Base(path) { | ||
| case "eval.yaml", "eval.yml": | ||
| artifact = "eval.yaml" | ||
| var header struct { | ||
| SchemaVersion string `yaml:"schemaVersion"` | ||
| } | ||
| if err := yaml.Unmarshal(data, &header); err != nil { | ||
| return "", "", fmt.Errorf("parsing %s: %w", path, err) | ||
| } | ||
| return artifact, header.SchemaVersion, nil | ||
| default: | ||
| if filepath.Ext(path) == ".json" { | ||
| version, ok, err := models.ProbeEvaluationOutcomeSchemaVersion(data) | ||
| if err != nil { | ||
| return "", "", fmt.Errorf("parsing %s: %w", path, err) | ||
| } | ||
| if !ok { | ||
| return "", "", fmt.Errorf("unsupported JSON schema artifact %s: expected a results.json object with top-level eval_id, eval_name, summary, or tasks", path) | ||
| } | ||
| return "results.json", version, nil | ||
| } | ||
| } | ||
| return "", "", fmt.Errorf("unsupported schema artifact %s: expected eval.yaml, eval.yml, or a JSON results artifact", path) | ||
| } |
This was referenced Jun 28, 2026
spboyer
added a commit
that referenced
this pull request
Jun 28, 2026
* feat: per-turn checkpoint graders (closes #358) Add an additive `checkpoints:` field to task YAML so multi-turn evals can grade conversation state at specific turn boundaries instead of only the final output. - New `Checkpoint` model with after_turn, graders, on_failure (continue/stop) - New `CheckpointOutcome` recorded per task on results.json - Per-turn hook in runner (initial + follow_ups + responder loop) - on_failure: stop aborts remaining turns and flips status to error - Bumped schemaVersion to 1.1 (additive, MINOR bump per #382 policy) - Reuses existing grader plumbing (graders.RunAll + buildGraderContext) - Honors --skip-graders by short-circuiting checkpoint evaluation - Full unit + integration tests; docs (guide + schema + changelog) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix: address Copilot review feedback on #386 - Add Type field to synthesized _checkpoint_error GraderResults - Fix docs to reference 'graders:' (the actual YAML key) instead of 'validators:' - Update schema-changes.md Policy section to reflect current 1.1 default emission while preserving 1.0 reader fallback for back-compat Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * chore: gitignore .impeccable/ cache directory --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
spboyer
pushed a commit
that referenced
this pull request
Jun 28, 2026
- Add argmatcher package (equals/regex/contains/range/json_schema)
- Extend tool_calls grader with expect: [{tool, args}] block
- Extend tool_constraint grader with args: matchers on expect_tools
- Add normalized tool_events[] to RunResult (turn, sequence, tool_call_id,
tool_name, args, result, success, duration_ms, error) populated from
session events — replay-friendly for Wave 3 (#367), OTel-aligned
- Bump results.json schemaVersion to 1.1 (MINOR additive per #368/#382)
- waza compare prints aggregate TOOL USE section (total calls, success
rate, avg/task, histogram, selection accuracy) when tool data present
- Unit tests for matchers, builder, both graders, schema round-trip,
compare metrics + histogram
- Docs: graders.mdx (expect/args), schema-changes.md (1.1 entry),
cli.mdx (compare TOOL USE), README.md
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
spboyer
added a commit
that referenced
this pull request
Jun 28, 2026
#388) * feat: per-task tool metrics with structured arg matchers (closes #366) - Add argmatcher package (equals/regex/contains/range/json_schema) - Extend tool_calls grader with expect: [{tool, args}] block - Extend tool_constraint grader with args: matchers on expect_tools - Add normalized tool_events[] to RunResult (turn, sequence, tool_call_id, tool_name, args, result, success, duration_ms, error) populated from session events — replay-friendly for Wave 3 (#367), OTel-aligned - Bump results.json schemaVersion to 1.1 (MINOR additive per #368/#382) - waza compare prints aggregate TOOL USE section (total calls, success rate, avg/task, histogram, selection accuracy) when tool data present - Unit tests for matchers, builder, both graders, schema round-trip, compare metrics + histogram - Docs: graders.mdx (expect/args), schema-changes.md (1.1 entry), cli.mdx (compare TOOL USE), README.md Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix: address Copilot review feedback on #388 - graders.mdx: matchers are single-key mappings (no kind field); graders evaluate session_digest.tool_calls (not tool_events[]); range matcher uses gte/lte/gt/lt (not [min, max]). - tool_events.go: stringifyResult comment matches JSON-only behavior. - cmd_compare.go: histogram bucketed per-run (not truncated per-task avg); added 'Tasks w/ tools' row; renamed 'Tasks w/' to 'Runs w/' labels; use tagged switch on runCalls. - schema-changes.md / README.md / schema.mdx: missing schemaVersion is interpreted as the current schema version (1.1), not 1.0. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * test: fix MCP mocks schemaVersion test after rebase The MCP mocks test in #387 used an empty schemaVersion and expected the 1.1 error path. Because LoadEvalSpec normalizes empty schemaVersion to the current version (1.1), the test passed validation instead of failing. Make the test explicit by setting schemaVersion: '1.0' to actually trigger the gate, then bump to '1.1' in the second half. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix: address round-2 Copilot review on #388 - persist compiled matcher in validateToolSpecs (map value semantics) - capture engine-specific tool args via ToolCallArgs.Extra (mapstructure ',remain') - bucket call_count_histogram per-task across trials (not per-run) - rename TOOL USE table label 'Runs w/' -> 'Tasks w/' to match metric - sync README tool_events[] field list with ToolEvent struct - add IsCompiled() accessor + tests for persisted compile, extra args, and per-task histogram with trials_per_task > 1 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
schemaVersionsupport foreval.yamlandresults.json, defaulting missing versions to1.0for backward compatibility.waza migrate <file>.waza migrate <file>command for future major-version migrations.internal/validation/testdata/and updates docs/site references with a schema changes changelog page.schemaVersionto the dashboard SSE event envelope type.Notes
snapshot.jsonandwaza gateoutput are not currently emitted in this branch; the schema policy docs reserve those artifacts for the related in-flight features that introduce them.results-*.jsonfiles by default.Validation
/opt/homebrew/bin/go test ./.../opt/homebrew/bin/golangci-lint runcd site && npm run buildcd web && npm run buildcd web && npm run lintCloses #368