feat: schema versioning policy (closes #368) by spboyer · Pull Request #382 · microsoft/waza

spboyer · 2026-06-28T11:11:11Z

Summary

Adds schemaVersion support for eval.yaml and results.json, defaulting missing versions to 1.0 for backward compatibility.
Implements same-major compatibility with unknown-field warnings and cross-major errors that point to waza migrate <file>.
Adds a stub waza migrate <file> command for future major-version migrations.
Adds schema compatibility fixtures/tests under internal/validation/testdata/ and updates docs/site references with a schema changes changelog page.
Adds schemaVersion to the dashboard SSE event envelope type.

Notes

snapshot.json and waza gate output are not currently emitted in this branch; the schema policy docs reserve those artifacts for the related in-flight features that introduce them.
The results golden fixture is force-added because the repo ignores generated results-*.json files by default.

Validation

/opt/homebrew/bin/go test ./...
/opt/homebrew/bin/golangci-lint run
cd site && npm run build
cd web && npm run build
cd web && npm run lint

Closes #368

Copilot

Pull request overview

Implements a schema versioning policy for waza public artifacts (notably eval.yaml and results.json), adding schemaVersion with backward-compatible defaults and reader behavior (same-major compatibility with warnings; cross-major rejection with migration guidance). It also introduces initial compatibility fixtures/tests, documentation for the policy, and a stub waza migrate <file> command.

Changes:

Add schemaVersion to public artifacts/models, defaulting missing versions to 1.0, and enforce cross-major incompatibility.
Add compatibility fixtures/tests under internal/validation/testdata/ to ensure older same-major artifacts remain readable.
Document the policy in the docs site and add a stub waza migrate <file> CLI command for future major migrations.

Show a summary per file

File	Description
web/src/hooks/useSSE.ts	Adds optional `schemaVersion` to the SSE event envelope type.
web/dist/index.html	Updates embedded dashboard entrypoint asset hash reference.
site/src/content/docs/reference/schema.mdx	Documents `schemaVersion` in the `eval.yaml` schema reference.
site/src/content/docs/reference/schema-changes.md	Adds a schema policy + changelog page for versioned artifacts.
site/src/content/docs/reference/cli.mdx	Documents the new `waza migrate` command.
site/src/content/docs/guides/eval-yaml.mdx	Adds `schemaVersion` to examples and explains the policy.
site/astro.config.mjs	Adds “Schema Changes” to the docs navigation.
schemas/eval.schema.json	Adds `schemaVersion` property to the JSON schema for `eval.yaml`.
README.md	Documents `schemaVersion` and the new `waza migrate` command.
internal/webapi/store.go	Loads outcomes via schema-aware parsing to enforce compatibility rules.
internal/webapi/additional_test.go	Minor test control-flow tweak (adds `return` after `t.Fatal`).
internal/validation/testdata/results-1.0.json	Adds a results fixture for schema `1.0`.
internal/validation/testdata/eval-1.0.yaml	Adds an eval fixture for schema `1.0`.
internal/validation/compatibility_test.go	Adds tests ensuring schema fixtures load and tolerate same-major unknown fields.
internal/storage/local.go	Switches local store loading to schema-aware outcome parsing.
internal/storage/azure_blob.go	Switches Azure blob download parsing to schema-aware outcome parsing.
internal/scaffold/scaffold.go	Adds `schemaVersion: "1.0"` to scaffolded eval YAML output.
internal/scaffold/scaffold_test.go	Verifies scaffolded eval YAML includes `schemaVersion`.
internal/models/testcase.go	Replaces strict YAML unknown-field rejection with same-major warning behavior.
internal/models/testcase_test.go	Updates test expectations: unknown fields are now tolerated for compatibility.
internal/models/spec.go	Adds `SchemaVersion` to `EvalSpec` and implements schema-version validation + unknown-field warning path.
internal/models/spec_test.go	Adds/updates tests for defaulting schemaVersion and rejecting different majors.
internal/models/schema_version.go	Introduces schema version parsing/validation and unknown-field warning helpers; adds schema-aware results parsing.
internal/models/outcome.go	Adds `SchemaVersion` to `EvaluationOutcome` and defaults it in JSON marshaling.
internal/models/outcome_schema_test.go	Adds tests for results schema defaulting/compatibility behavior.
internal/models/grader_validation_test.go	Updates expected error message text related to missing assertions.
internal/models/grader_params.go	Changes grader param decoding to warn (not error) on unknown YAML fields.
internal/mcp/coverage_test.go	Minor test control-flow tweaks (adds `return` after `t.Fatal`).
examples/repo-resources/eval.yaml	Adds `schemaVersion: "1.0"` to example eval.
examples/grader-showcase/eval.yaml	Adds `schemaVersion: "1.0"` to example eval.
examples/custom-agent/eval.yaml	Adds `schemaVersion: "1.0"` to example eval.
examples/code-explainer/eval.yaml	Adds `schemaVersion: "1.0"` to example eval.
docs/PRD.md	Adds a PRD entry capturing schema-versioned public artifacts.
cmd/waza/root.go	Registers the new `migrate` command.
cmd/waza/cmd_migrate.go	Adds stub `waza migrate <file>` command implementation.
cmd/waza/cmd_migrate_test.go	Adds tests for migrate command no-op and missing-file error.
cmd/waza/cmd_grade.go	Uses schema-aware parsing for results input when grading.
cmd/waza/cmd_compare.go	Uses schema-aware results loading for comparisons.

Review details

Files reviewed: 37/38 changed files
Comments generated: 3
Review effort level: Low

Copilot

Review details

Files reviewed: 37/38 changed files
Comments generated: 3
Review effort level: Low

Copilot

Review details

Files reviewed: 38/39 changed files
Comments generated: 4
Review effort level: Low

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Review details

Files reviewed: 39/40 changed files
Comments generated: 3
Review effort level: Low

+		ok, err := isOutcomeJSON(data)
+		if err != nil {
+			return nil
+		}


+		_, ok, err := models.ProbeEvaluationOutcomeSchemaVersion(data)
+		if err != nil {
+			return nil
+		}


+func readArtifactSchemaVersion(path string, data []byte) (artifact string, version string, err error) {
+	switch filepath.Base(path) {
+	case "eval.yaml", "eval.yml":
+		artifact = "eval.yaml"
+		var header struct {
+			SchemaVersion string `yaml:"schemaVersion"`
+		}
+		if err := yaml.Unmarshal(data, &header); err != nil {
+			return "", "", fmt.Errorf("parsing %s: %w", path, err)
+		}
+		return artifact, header.SchemaVersion, nil
+	default:
+		if filepath.Ext(path) == ".json" {
+			version, ok, err := models.ProbeEvaluationOutcomeSchemaVersion(data)
+			if err != nil {
+				return "", "", fmt.Errorf("parsing %s: %w", path, err)
+			}
+			if !ok {
+				return "", "", fmt.Errorf("unsupported JSON schema artifact %s: expected a results.json object with top-level eval_id, eval_name, summary, or tasks", path)
+			}
+			return "results.json", version, nil
+		}
+	}
+	return "", "", fmt.Errorf("unsupported schema artifact %s: expected eval.yaml, eval.yml, or a JSON results artifact", path)
+}


* feat: per-turn checkpoint graders (closes #358) Add an additive `checkpoints:` field to task YAML so multi-turn evals can grade conversation state at specific turn boundaries instead of only the final output. - New `Checkpoint` model with after_turn, graders, on_failure (continue/stop) - New `CheckpointOutcome` recorded per task on results.json - Per-turn hook in runner (initial + follow_ups + responder loop) - on_failure: stop aborts remaining turns and flips status to error - Bumped schemaVersion to 1.1 (additive, MINOR bump per #382 policy) - Reuses existing grader plumbing (graders.RunAll + buildGraderContext) - Honors --skip-graders by short-circuiting checkpoint evaluation - Full unit + integration tests; docs (guide + schema + changelog) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix: address Copilot review feedback on #386 - Add Type field to synthesized _checkpoint_error GraderResults - Fix docs to reference 'graders:' (the actual YAML key) instead of 'validators:' - Update schema-changes.md Policy section to reflect current 1.1 default emission while preserving 1.0 reader fallback for back-compat Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * chore: gitignore .impeccable/ cache directory --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Add argmatcher package (equals/regex/contains/range/json_schema) - Extend tool_calls grader with expect: [{tool, args}] block - Extend tool_constraint grader with args: matchers on expect_tools - Add normalized tool_events[] to RunResult (turn, sequence, tool_call_id, tool_name, args, result, success, duration_ms, error) populated from session events — replay-friendly for Wave 3 (#367), OTel-aligned - Bump results.json schemaVersion to 1.1 (MINOR additive per #368/#382) - waza compare prints aggregate TOOL USE section (total calls, success rate, avg/task, histogram, selection accuracy) when tool data present - Unit tests for matchers, builder, both graders, schema round-trip, compare metrics + histogram - Docs: graders.mdx (expect/args), schema-changes.md (1.1 entry), cli.mdx (compare TOOL USE), README.md Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

#388) * feat: per-task tool metrics with structured arg matchers (closes #366) - Add argmatcher package (equals/regex/contains/range/json_schema) - Extend tool_calls grader with expect: [{tool, args}] block - Extend tool_constraint grader with args: matchers on expect_tools - Add normalized tool_events[] to RunResult (turn, sequence, tool_call_id, tool_name, args, result, success, duration_ms, error) populated from session events — replay-friendly for Wave 3 (#367), OTel-aligned - Bump results.json schemaVersion to 1.1 (MINOR additive per #368/#382) - waza compare prints aggregate TOOL USE section (total calls, success rate, avg/task, histogram, selection accuracy) when tool data present - Unit tests for matchers, builder, both graders, schema round-trip, compare metrics + histogram - Docs: graders.mdx (expect/args), schema-changes.md (1.1 entry), cli.mdx (compare TOOL USE), README.md Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix: address Copilot review feedback on #388 - graders.mdx: matchers are single-key mappings (no kind field); graders evaluate session_digest.tool_calls (not tool_events[]); range matcher uses gte/lte/gt/lt (not [min, max]). - tool_events.go: stringifyResult comment matches JSON-only behavior. - cmd_compare.go: histogram bucketed per-run (not truncated per-task avg); added 'Tasks w/ tools' row; renamed 'Tasks w/' to 'Runs w/' labels; use tagged switch on runCalls. - schema-changes.md / README.md / schema.mdx: missing schemaVersion is interpreted as the current schema version (1.1), not 1.0. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * test: fix MCP mocks schemaVersion test after rebase The MCP mocks test in #387 used an empty schemaVersion and expected the 1.1 error path. Because LoadEvalSpec normalizes empty schemaVersion to the current version (1.1), the test passed validation instead of failing. Make the test explicit by setting schemaVersion: '1.0' to actually trigger the gate, then bump to '1.1' in the second half. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix: address round-2 Copilot review on #388 - persist compiled matcher in validateToolSpecs (map value semantics) - capture engine-specific tool args via ToolCallArgs.Extra (mapstructure ',remain') - bucket call_count_histogram per-task across trials (not per-run) - rename TOOL USE table label 'Runs w/' -> 'Tasks w/' to match metric - sync README tool_events[] field list with ToolEvent struct - add IsCompiled() accessor + tests for persisted compile, extra args, and per-task histogram with trials_per_task > 1 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings June 28, 2026 11:11

Copilot started reviewing on behalf of spboyer June 28, 2026 11:11 View session

Copilot AI reviewed Jun 28, 2026

View reviewed changes

Comment thread internal/models/schema_version.go Outdated

Comment thread internal/storage/azure_blob.go Outdated

Comment thread cmd/waza/cmd_migrate.go Outdated

Copilot AI review requested due to automatic review settings June 28, 2026 11:31

spboyer force-pushed the spboyer-issue-368-schema-versioning branch from 08b1205 to 654c54c Compare June 28, 2026 11:31

Copilot started reviewing on behalf of spboyer June 28, 2026 11:31 View session

Copilot AI reviewed Jun 28, 2026

View reviewed changes

Comment thread cmd/waza/cmd_migrate.go

Comment thread internal/webapi/store.go Outdated

Comment thread internal/storage/local.go Outdated

spboyer force-pushed the spboyer-issue-368-schema-versioning branch from 654c54c to f6cbcb2 Compare June 28, 2026 11:35

Copilot AI review requested due to automatic review settings June 28, 2026 11:43

spboyer force-pushed the spboyer-issue-368-schema-versioning branch from f6cbcb2 to bff7007 Compare June 28, 2026 11:43

Copilot started reviewing on behalf of spboyer June 28, 2026 11:43 View session

Copilot AI reviewed Jun 28, 2026

View reviewed changes

Comment thread internal/models/schema_version.go Outdated

Comment thread internal/webapi/store.go

Comment thread internal/storage/local.go

Comment thread cmd/waza/cmd_migrate.go

Copilot AI added 4 commits June 28, 2026 07:49

feat: add schema versioning policy #368

b5e6170

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

fix: harden schema migration guidance #368

1016765

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

test: stabilize inline grader duration check #368

e5be8b8

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

fix: detect schema artifact json shape #368

6b78b0d

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

spboyer force-pushed the spboyer-issue-368-schema-versioning branch from bff7007 to 6b78b0d Compare June 28, 2026 11:51

fix: skip incompatible stored results #368

5ce0e7b

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings June 28, 2026 11:53

Copilot started reviewing on behalf of spboyer June 28, 2026 11:54 View session

spboyer merged commit 3458ced into main Jun 28, 2026
11 checks passed

spboyer deleted the spboyer-issue-368-schema-versioning branch June 28, 2026 11:57

Copilot AI reviewed Jun 28, 2026

View reviewed changes

This was referenced Jun 28, 2026

feat: per-turn checkpoint graders (closes #358) #386

Merged

feat: per-task tool metrics with structured arg matchers (closes #366) #388

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: schema versioning policy (closes #368)#382

feat: schema versioning policy (closes #368)#382
spboyer merged 5 commits into
mainfrom
spboyer-issue-368-schema-versioning

spboyer commented Jun 28, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

spboyer commented Jun 28, 2026

Summary

Notes

Validation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Review details

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Review details

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Review details

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Review details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants