feat: per-turn checkpoint graders (closes #358) by spboyer · Pull Request #386 · microsoft/waza

spboyer · 2026-06-28T12:10:23Z

Summary

Implements #358 — per-turn checkpoint graders for multi-turn evals.

By default graders only run once, against the final state of a conversation. For long multi-turn runs that's too coarse — a task can drift halfway through and the final-pass grader misses it. This PR adds an additive, opt-in checkpoints: field to task YAML so authors can run any existing grader against the cumulative conversation state at specific turn boundaries.

checkpoints:
  - after_turn: 1
    graders:
      - type: text
        contains: ["analyzing", "files"]
  - after_turn: 2
    on_failure: stop
    graders:
      - type: tool_calls
        required: ["read_file"]

Scope

Per-turn graders, NOT eval-resume / pause-and-resume. The issue title mentions "checkpoint + resume" but the revised spec at revised/358.md (Wave-2 source of truth) scopes this to per-turn grading; resume is tracked separately.
Backward compatible: 1.0 task files (no checkpoints: block) load unchanged. schemaVersion bumped to 1.1 per the MINOR-bump additive policy from feat: schema versioning policy (closes #368) #382.

Design

Layer	Change
Models	New `Checkpoint{AfterTurn, Graders, OnFailure}` and `CheckpointOutcome{AfterTurn, Status, Validations, Stopped}`. `TestCase.Checkpoints` field + validation (after_turn ≥ 1, no duplicates, no exceed-turns, only when responder is nil for static fans).
Orchestration	New `internal/orchestration/checkpoints.go` `checkpointRunner` helper. Hooks fire after the initial turn (in `executeRun`), after each static follow-up turn (in `executeFollowUps`), and after each responder reply (in `executeResponderLoop`). Reuses `buildGraderContext` + `graders.RunAll` via a stub `TestCase` so checkpoint graders run independently of task-level validators.
Results	`RunResult.Checkpoints []CheckpointOutcome` propagated. `waza gate` still keys off final-pass status. Checkpoint failures flip status to `failed` (or `error` when `on_failure: stop` fired and `resp.ErrorMsg` is set).
`--skip-graders`	Honored: checkpoint evaluation short-circuits identically to final-pass grading, even for `on_failure: stop` checkpoints.

Test plan

internal/models/testcase_test.go — TestLoadTestCase_CheckpointsParsed, TestCheckpointValidation (7 subtests: invalid after_turn, missing graders, duplicate after_turn, exceeds turns, invalid on_failure, valid, responder upper-bound).
internal/orchestration/runner_orchestration_test.go — 5 new integration tests:
- TestExecuteRun_Checkpoints_PassAfterEachTurn
- TestExecuteRun_Checkpoints_FailContinue — all turns run, Status=Failed
- TestExecuteRun_Checkpoints_FailStop — short-circuits at turn 2, Stopped=true, Status=Error
- TestExecuteRun_Checkpoints_BackwardCompat — no checkpoints → Checkpoints == nil
- TestExecuteRun_Checkpoints_SkipGradersHonored — WithSkipGraders skips both grading and the stop signal
internal/models/outcome_schema_test.go — updated to expect schemaVersion: "1.1".
cmd/waza/cmd_migrate_test.go — updated messages for the new current version.
Full suite, vet, and the site build are green.

Docs

site/src/content/docs/reference/schema.mdx — new ### checkpoints section under the task spec, plus default schemaVersion bump.
site/src/content/docs/guides/eval-yaml.mdx — new "Per-Turn Checkpoints" subsection under Responder.
site/src/content/docs/reference/schema-changes.md — added 1.1 changelog entry; bumped current-version table.

Notes for reviewers

Turn numbering: initial prompt = turn 1, static follow-up index i = turn i+2, responder reply N = turn N+1.
Dashboard (web/) is not updated in this PR — surfacing checkpoint outcomes in the UI is intentionally deferred to a follow-up; the new fields are already serialized to results.json.

Closes #358

Add an additive `checkpoints:` field to task YAML so multi-turn evals can grade conversation state at specific turn boundaries instead of only the final output. - New `Checkpoint` model with after_turn, graders, on_failure (continue/stop) - New `CheckpointOutcome` recorded per task on results.json - Per-turn hook in runner (initial + follow_ups + responder loop) - on_failure: stop aborts remaining turns and flips status to error - Bumped schemaVersion to 1.1 (additive, MINOR bump per #382 policy) - Reuses existing grader plumbing (graders.RunAll + buildGraderContext) - Honors --skip-graders by short-circuiting checkpoint evaluation - Full unit + integration tests; docs (guide + schema + changelog) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Adds opt-in per-turn checkpoint grading for multi-turn evals by introducing a checkpoints: block in task YAML (schema bumped to 1.1) and plumbing checkpoint outcomes into results.json run results. This extends the existing grading system so intermediate conversation state can be asserted without changing existing grader implementations.

Changes:

Introduces TestCase.Checkpoints + validation and records RunResult.Checkpoints outcomes.
Adds orchestration support to run checkpoint graders after specific turns (static follow-ups and responder loop), with optional on_failure: stop.
Updates docs and tests to reflect schemaVersion 1.1 and the new YAML surface area.

Show a summary per file

File	Description
`internal/orchestration/checkpoints.go`	New checkpoint runner that executes per-turn graders and records outcomes.
`internal/orchestration/runner.go`	Wires checkpoint execution into multi-turn orchestration and propagates outcomes to results + status.
`internal/orchestration/runner_orchestration_test.go`	Integration tests covering pass/fail/stop/backward-compat/skip-graders behavior.
`internal/models/testcase.go`	Adds checkpoint model + validation rules to task schema.
`internal/models/testcase_test.go`	Parsing + validation unit tests for checkpoints.
`internal/models/outcome.go`	Adds `RunResult.Checkpoints` and `CheckpointOutcome` result shape.
`internal/models/outcome_schema_test.go`	Updates expected default schemaVersion to 1.1.
`internal/models/schema_version.go`	Bumps `CurrentSchemaVersion` to 1.1 with inline rationale.
`cmd/waza/cmd_migrate_test.go`	Updates expected migrate messaging for current schemaVersion 1.1.
`site/src/content/docs/reference/schema.mdx`	Documents `checkpoints` in task schema and updates default schemaVersion to 1.1.
`site/src/content/docs/guides/eval-yaml.mdx`	Adds “Per-Turn Checkpoints” guide section and example.
`site/src/content/docs/reference/schema-changes.md`	Updates current-version table and adds a 1.1 changelog entry.

Review details

Files reviewed: 12/12 changed files
Comments generated: 6
Review effort level: Low

- Add Type field to synthesized _checkpoint_error GraderResults - Fix docs to reference 'graders:' (the actual YAML key) instead of 'validators:' - Update schema-changes.md Policy section to reflect current 1.1 default emission while preserving 1.0 reader fallback for back-compat Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Review details

Comments suppressed due to low confidence (1)

internal/models/schema_version.go:28

defaultSchemaVersion currently defaults a missing schemaVersion to CurrentSchemaVersion (now 1.1). The PR description + schema policy docs say missing schemaVersion should default to 1.0 for backward compatibility, so this change alters the meaning of legacy artifacts (and also drives waza migrate output/test expectations).

func defaultSchemaVersion(version string) string {
	if strings.TrimSpace(version) == "" {
		return CurrentSchemaVersion
	}
	return version

Files reviewed: 12/13 changed files
Comments generated: 4
Review effort level: Low

+```yaml
+checkpoints:
+  - after_turn: 1
+    graders:
+      - type: text
+        contains: ["analyzing", "files"]
+  - after_turn: 2
+    on_failure: stop
+    graders:
+      - type: tool_calls
+        required: ["read_file"]
+```


+```yaml
+checkpoints:
+  - after_turn: 1
+    graders:
+      - type: text
+        contains: ["analyzing", "files"]
+  - after_turn: 2
+    on_failure: stop # abort the run if this checkpoint fails
+    graders:
+      - type: tool_calls
+        required: ["read_file"]
+```


+	// Status is StatusPassed when every grader in this checkpoint passed,
+	// StatusFailed when at least one grader failed.
+	Status Status `json:"status"`
+	// Validations maps grader identifier to result, identical to
+	// RunResult.Validations.
+	Validations map[string]GraderResults `json:"validations"`
+	// Stopped is true when this checkpoint had `on_failure: stop` and at


+	// Surface checkpoint failures in the run status even when graders are
+	// skipped or when the final-pass graders all passed. A failed checkpoint
+	// without on_failure: stop should still mark the run as failed; a
+	// checkpoint that recorded StatusError (grader-execution error) should
+	// promote the run to StatusError so consumers can distinguish
+	// infrastructure problems from assertion failures.


Copilot AI review requested due to automatic review settings June 28, 2026 12:10

Copilot started reviewing on behalf of spboyer June 28, 2026 12:10 View session

Copilot AI reviewed Jun 28, 2026

View reviewed changes

Copilot AI added 2 commits June 28, 2026 08:17

chore: gitignore .impeccable/ cache directory

b1fa9bc

Copilot AI review requested due to automatic review settings June 28, 2026 12:18

Copilot started reviewing on behalf of spboyer June 28, 2026 12:18 View session

spboyer merged commit ac7c35e into main Jun 28, 2026
11 checks passed

spboyer deleted the spboyer-feat-per-turn-checkpoints branch June 28, 2026 12:22

Copilot AI reviewed Jun 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: per-turn checkpoint graders (closes #358)#386

feat: per-turn checkpoint graders (closes #358)#386
spboyer merged 3 commits into
mainfrom
spboyer-feat-per-turn-checkpoints

spboyer commented Jun 28, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

spboyer commented Jun 28, 2026

Summary

Scope

Design

Test plan

Docs

Notes for reviewers

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Review details

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Review details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants