Skip to content

feat: schema versioning policy (closes #368)#382

Merged
spboyer merged 5 commits into
mainfrom
spboyer-issue-368-schema-versioning
Jun 28, 2026
Merged

feat: schema versioning policy (closes #368)#382
spboyer merged 5 commits into
mainfrom
spboyer-issue-368-schema-versioning

Conversation

@spboyer

@spboyer spboyer commented Jun 28, 2026

Copy link
Copy Markdown
Member

Summary

  • Adds schemaVersion support for eval.yaml and results.json, defaulting missing versions to 1.0 for backward compatibility.
  • Implements same-major compatibility with unknown-field warnings and cross-major errors that point to waza migrate <file>.
  • Adds a stub waza migrate <file> command for future major-version migrations.
  • Adds schema compatibility fixtures/tests under internal/validation/testdata/ and updates docs/site references with a schema changes changelog page.
  • Adds schemaVersion to the dashboard SSE event envelope type.

Notes

  • snapshot.json and waza gate output are not currently emitted in this branch; the schema policy docs reserve those artifacts for the related in-flight features that introduce them.
  • The results golden fixture is force-added because the repo ignores generated results-*.json files by default.

Validation

  • /opt/homebrew/bin/go test ./...
  • /opt/homebrew/bin/golangci-lint run
  • cd site && npm run build
  • cd web && npm run build
  • cd web && npm run lint

Closes #368

Copilot AI review requested due to automatic review settings June 28, 2026 11:11

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements a schema versioning policy for waza public artifacts (notably eval.yaml and results.json), adding schemaVersion with backward-compatible defaults and reader behavior (same-major compatibility with warnings; cross-major rejection with migration guidance). It also introduces initial compatibility fixtures/tests, documentation for the policy, and a stub waza migrate <file> command.

Changes:

  • Add schemaVersion to public artifacts/models, defaulting missing versions to 1.0, and enforce cross-major incompatibility.
  • Add compatibility fixtures/tests under internal/validation/testdata/ to ensure older same-major artifacts remain readable.
  • Document the policy in the docs site and add a stub waza migrate <file> CLI command for future major migrations.
Show a summary per file
File Description
web/src/hooks/useSSE.ts Adds optional schemaVersion to the SSE event envelope type.
web/dist/index.html Updates embedded dashboard entrypoint asset hash reference.
site/src/content/docs/reference/schema.mdx Documents schemaVersion in the eval.yaml schema reference.
site/src/content/docs/reference/schema-changes.md Adds a schema policy + changelog page for versioned artifacts.
site/src/content/docs/reference/cli.mdx Documents the new waza migrate command.
site/src/content/docs/guides/eval-yaml.mdx Adds schemaVersion to examples and explains the policy.
site/astro.config.mjs Adds “Schema Changes” to the docs navigation.
schemas/eval.schema.json Adds schemaVersion property to the JSON schema for eval.yaml.
README.md Documents schemaVersion and the new waza migrate command.
internal/webapi/store.go Loads outcomes via schema-aware parsing to enforce compatibility rules.
internal/webapi/additional_test.go Minor test control-flow tweak (adds return after t.Fatal).
internal/validation/testdata/results-1.0.json Adds a results fixture for schema 1.0.
internal/validation/testdata/eval-1.0.yaml Adds an eval fixture for schema 1.0.
internal/validation/compatibility_test.go Adds tests ensuring schema fixtures load and tolerate same-major unknown fields.
internal/storage/local.go Switches local store loading to schema-aware outcome parsing.
internal/storage/azure_blob.go Switches Azure blob download parsing to schema-aware outcome parsing.
internal/scaffold/scaffold.go Adds schemaVersion: "1.0" to scaffolded eval YAML output.
internal/scaffold/scaffold_test.go Verifies scaffolded eval YAML includes schemaVersion.
internal/models/testcase.go Replaces strict YAML unknown-field rejection with same-major warning behavior.
internal/models/testcase_test.go Updates test expectations: unknown fields are now tolerated for compatibility.
internal/models/spec.go Adds SchemaVersion to EvalSpec and implements schema-version validation + unknown-field warning path.
internal/models/spec_test.go Adds/updates tests for defaulting schemaVersion and rejecting different majors.
internal/models/schema_version.go Introduces schema version parsing/validation and unknown-field warning helpers; adds schema-aware results parsing.
internal/models/outcome.go Adds SchemaVersion to EvaluationOutcome and defaults it in JSON marshaling.
internal/models/outcome_schema_test.go Adds tests for results schema defaulting/compatibility behavior.
internal/models/grader_validation_test.go Updates expected error message text related to missing assertions.
internal/models/grader_params.go Changes grader param decoding to warn (not error) on unknown YAML fields.
internal/mcp/coverage_test.go Minor test control-flow tweaks (adds return after t.Fatal).
examples/repo-resources/eval.yaml Adds schemaVersion: "1.0" to example eval.
examples/grader-showcase/eval.yaml Adds schemaVersion: "1.0" to example eval.
examples/custom-agent/eval.yaml Adds schemaVersion: "1.0" to example eval.
examples/code-explainer/eval.yaml Adds schemaVersion: "1.0" to example eval.
docs/PRD.md Adds a PRD entry capturing schema-versioned public artifacts.
cmd/waza/root.go Registers the new migrate command.
cmd/waza/cmd_migrate.go Adds stub waza migrate <file> command implementation.
cmd/waza/cmd_migrate_test.go Adds tests for migrate command no-op and missing-file error.
cmd/waza/cmd_grade.go Uses schema-aware parsing for results input when grading.
cmd/waza/cmd_compare.go Uses schema-aware results loading for comparisons.

Review details

  • Files reviewed: 37/38 changed files
  • Comments generated: 3
  • Review effort level: Low

Comment thread internal/models/schema_version.go Outdated
Comment thread internal/storage/azure_blob.go Outdated
Comment thread cmd/waza/cmd_migrate.go Outdated
Copilot AI review requested due to automatic review settings June 28, 2026 11:31
@spboyer spboyer force-pushed the spboyer-issue-368-schema-versioning branch from 08b1205 to 654c54c Compare June 28, 2026 11:31

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review details

  • Files reviewed: 37/38 changed files
  • Comments generated: 3
  • Review effort level: Low

Comment thread cmd/waza/cmd_migrate.go
Comment thread internal/webapi/store.go Outdated
Comment thread internal/storage/local.go Outdated
@spboyer spboyer force-pushed the spboyer-issue-368-schema-versioning branch from 654c54c to f6cbcb2 Compare June 28, 2026 11:35
Copilot AI review requested due to automatic review settings June 28, 2026 11:43
@spboyer spboyer force-pushed the spboyer-issue-368-schema-versioning branch from f6cbcb2 to bff7007 Compare June 28, 2026 11:43

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review details

  • Files reviewed: 38/39 changed files
  • Comments generated: 4
  • Review effort level: Low

Comment thread internal/models/schema_version.go Outdated
Comment thread internal/webapi/store.go
Comment thread internal/storage/local.go
Comment thread cmd/waza/cmd_migrate.go
Copilot AI added 4 commits June 28, 2026 07:49
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@spboyer spboyer force-pushed the spboyer-issue-368-schema-versioning branch from bff7007 to 6b78b0d Compare June 28, 2026 11:51
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 28, 2026 11:53
@spboyer spboyer merged commit 3458ced into main Jun 28, 2026
11 checks passed
@spboyer spboyer deleted the spboyer-issue-368-schema-versioning branch June 28, 2026 11:57

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review details

  • Files reviewed: 39/40 changed files
  • Comments generated: 3
  • Review effort level: Low

Comment thread internal/webapi/store.go
Comment on lines +89 to +92
ok, err := isOutcomeJSON(data)
if err != nil {
return nil
}
Comment thread internal/storage/local.go
Comment on lines +171 to +174
_, ok, err := models.ProbeEvaluationOutcomeSchemaVersion(data)
if err != nil {
return nil
}
Comment thread cmd/waza/cmd_migrate.go
Comment on lines +56 to +80
func readArtifactSchemaVersion(path string, data []byte) (artifact string, version string, err error) {
switch filepath.Base(path) {
case "eval.yaml", "eval.yml":
artifact = "eval.yaml"
var header struct {
SchemaVersion string `yaml:"schemaVersion"`
}
if err := yaml.Unmarshal(data, &header); err != nil {
return "", "", fmt.Errorf("parsing %s: %w", path, err)
}
return artifact, header.SchemaVersion, nil
default:
if filepath.Ext(path) == ".json" {
version, ok, err := models.ProbeEvaluationOutcomeSchemaVersion(data)
if err != nil {
return "", "", fmt.Errorf("parsing %s: %w", path, err)
}
if !ok {
return "", "", fmt.Errorf("unsupported JSON schema artifact %s: expected a results.json object with top-level eval_id, eval_name, summary, or tasks", path)
}
return "results.json", version, nil
}
}
return "", "", fmt.Errorf("unsupported schema artifact %s: expected eval.yaml, eval.yml, or a JSON results artifact", path)
}
spboyer added a commit that referenced this pull request Jun 28, 2026
* feat: per-turn checkpoint graders (closes #358)

Add an additive `checkpoints:` field to task YAML so multi-turn evals
can grade conversation state at specific turn boundaries instead of
only the final output.

- New `Checkpoint` model with after_turn, graders, on_failure (continue/stop)
- New `CheckpointOutcome` recorded per task on results.json
- Per-turn hook in runner (initial + follow_ups + responder loop)
- on_failure: stop aborts remaining turns and flips status to error
- Bumped schemaVersion to 1.1 (additive, MINOR bump per #382 policy)
- Reuses existing grader plumbing (graders.RunAll + buildGraderContext)
- Honors --skip-graders by short-circuiting checkpoint evaluation
- Full unit + integration tests; docs (guide + schema + changelog)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix: address Copilot review feedback on #386

- Add Type field to synthesized _checkpoint_error GraderResults
- Fix docs to reference 'graders:' (the actual YAML key) instead of 'validators:'
- Update schema-changes.md Policy section to reflect current 1.1 default emission while preserving 1.0 reader fallback for back-compat

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* chore: gitignore .impeccable/ cache directory

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
spboyer pushed a commit that referenced this pull request Jun 28, 2026
- Add argmatcher package (equals/regex/contains/range/json_schema)
- Extend tool_calls grader with expect: [{tool, args}] block
- Extend tool_constraint grader with args: matchers on expect_tools
- Add normalized tool_events[] to RunResult (turn, sequence, tool_call_id,
  tool_name, args, result, success, duration_ms, error) populated from
  session events — replay-friendly for Wave 3 (#367), OTel-aligned
- Bump results.json schemaVersion to 1.1 (MINOR additive per #368/#382)
- waza compare prints aggregate TOOL USE section (total calls, success
  rate, avg/task, histogram, selection accuracy) when tool data present
- Unit tests for matchers, builder, both graders, schema round-trip,
  compare metrics + histogram
- Docs: graders.mdx (expect/args), schema-changes.md (1.1 entry),
  cli.mdx (compare TOOL USE), README.md

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
spboyer added a commit that referenced this pull request Jun 28, 2026
#388)

* feat: per-task tool metrics with structured arg matchers (closes #366)

- Add argmatcher package (equals/regex/contains/range/json_schema)
- Extend tool_calls grader with expect: [{tool, args}] block
- Extend tool_constraint grader with args: matchers on expect_tools
- Add normalized tool_events[] to RunResult (turn, sequence, tool_call_id,
  tool_name, args, result, success, duration_ms, error) populated from
  session events — replay-friendly for Wave 3 (#367), OTel-aligned
- Bump results.json schemaVersion to 1.1 (MINOR additive per #368/#382)
- waza compare prints aggregate TOOL USE section (total calls, success
  rate, avg/task, histogram, selection accuracy) when tool data present
- Unit tests for matchers, builder, both graders, schema round-trip,
  compare metrics + histogram
- Docs: graders.mdx (expect/args), schema-changes.md (1.1 entry),
  cli.mdx (compare TOOL USE), README.md

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix: address Copilot review feedback on #388

- graders.mdx: matchers are single-key mappings (no kind field);
  graders evaluate session_digest.tool_calls (not tool_events[]);
  range matcher uses gte/lte/gt/lt (not [min, max]).
- tool_events.go: stringifyResult comment matches JSON-only behavior.
- cmd_compare.go: histogram bucketed per-run (not truncated per-task avg);
  added 'Tasks w/ tools' row; renamed 'Tasks w/' to 'Runs w/' labels;
  use tagged switch on runCalls.
- schema-changes.md / README.md / schema.mdx: missing schemaVersion is
  interpreted as the current schema version (1.1), not 1.0.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* test: fix MCP mocks schemaVersion test after rebase

The MCP mocks test in #387 used an empty schemaVersion and expected
the 1.1 error path. Because LoadEvalSpec normalizes empty schemaVersion
to the current version (1.1), the test passed validation instead of
failing. Make the test explicit by setting schemaVersion: '1.0' to
actually trigger the gate, then bump to '1.1' in the second half.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix: address round-2 Copilot review on #388

- persist compiled matcher in validateToolSpecs (map value semantics)
- capture engine-specific tool args via ToolCallArgs.Extra (mapstructure ',remain')
- bucket call_count_histogram per-task across trials (not per-run)
- rename TOOL USE table label 'Runs w/' -> 'Tasks w/' to match metric
- sync README tool_events[] field list with ToolEvent struct
- add IsCompiled() accessor + tests for persisted compile, extra args, and
  per-task histogram with trials_per_task > 1

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Schema versioning policy and migration tooling for public artifacts

3 participants