feat: per-turn checkpoint graders (closes #358)#386
Merged
Conversation
Add an additive `checkpoints:` field to task YAML so multi-turn evals can grade conversation state at specific turn boundaries instead of only the final output. - New `Checkpoint` model with after_turn, graders, on_failure (continue/stop) - New `CheckpointOutcome` recorded per task on results.json - Per-turn hook in runner (initial + follow_ups + responder loop) - on_failure: stop aborts remaining turns and flips status to error - Bumped schemaVersion to 1.1 (additive, MINOR bump per #382 policy) - Reuses existing grader plumbing (graders.RunAll + buildGraderContext) - Honors --skip-graders by short-circuiting checkpoint evaluation - Full unit + integration tests; docs (guide + schema + changelog) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Adds opt-in per-turn checkpoint grading for multi-turn evals by introducing a checkpoints: block in task YAML (schema bumped to 1.1) and plumbing checkpoint outcomes into results.json run results. This extends the existing grading system so intermediate conversation state can be asserted without changing existing grader implementations.
Changes:
- Introduces
TestCase.Checkpoints+ validation and recordsRunResult.Checkpointsoutcomes. - Adds orchestration support to run checkpoint graders after specific turns (static follow-ups and responder loop), with optional
on_failure: stop. - Updates docs and tests to reflect schemaVersion 1.1 and the new YAML surface area.
Show a summary per file
| File | Description |
|---|---|
internal/orchestration/checkpoints.go |
New checkpoint runner that executes per-turn graders and records outcomes. |
internal/orchestration/runner.go |
Wires checkpoint execution into multi-turn orchestration and propagates outcomes to results + status. |
internal/orchestration/runner_orchestration_test.go |
Integration tests covering pass/fail/stop/backward-compat/skip-graders behavior. |
internal/models/testcase.go |
Adds checkpoint model + validation rules to task schema. |
internal/models/testcase_test.go |
Parsing + validation unit tests for checkpoints. |
internal/models/outcome.go |
Adds RunResult.Checkpoints and CheckpointOutcome result shape. |
internal/models/outcome_schema_test.go |
Updates expected default schemaVersion to 1.1. |
internal/models/schema_version.go |
Bumps CurrentSchemaVersion to 1.1 with inline rationale. |
cmd/waza/cmd_migrate_test.go |
Updates expected migrate messaging for current schemaVersion 1.1. |
site/src/content/docs/reference/schema.mdx |
Documents checkpoints in task schema and updates default schemaVersion to 1.1. |
site/src/content/docs/guides/eval-yaml.mdx |
Adds “Per-Turn Checkpoints” guide section and example. |
site/src/content/docs/reference/schema-changes.md |
Updates current-version table and adds a 1.1 changelog entry. |
Review details
- Files reviewed: 12/12 changed files
- Comments generated: 6
- Review effort level: Low
- Add Type field to synthesized _checkpoint_error GraderResults - Fix docs to reference 'graders:' (the actual YAML key) instead of 'validators:' - Update schema-changes.md Policy section to reflect current 1.1 default emission while preserving 1.0 reader fallback for back-compat Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
There was a problem hiding this comment.
Review details
Comments suppressed due to low confidence (1)
internal/models/schema_version.go:28
defaultSchemaVersioncurrently defaults a missing schemaVersion toCurrentSchemaVersion(now 1.1). The PR description + schema policy docs say missing schemaVersion should default to 1.0 for backward compatibility, so this change alters the meaning of legacy artifacts (and also driveswaza migrateoutput/test expectations).
func defaultSchemaVersion(version string) string {
if strings.TrimSpace(version) == "" {
return CurrentSchemaVersion
}
return version
- Files reviewed: 12/13 changed files
- Comments generated: 4
- Review effort level: Low
Comment on lines
+580
to
+591
| ```yaml | ||
| checkpoints: | ||
| - after_turn: 1 | ||
| graders: | ||
| - type: text | ||
| contains: ["analyzing", "files"] | ||
| - after_turn: 2 | ||
| on_failure: stop | ||
| graders: | ||
| - type: tool_calls | ||
| required: ["read_file"] | ||
| ``` |
Comment on lines
+428
to
+439
| ```yaml | ||
| checkpoints: | ||
| - after_turn: 1 | ||
| graders: | ||
| - type: text | ||
| contains: ["analyzing", "files"] | ||
| - after_turn: 2 | ||
| on_failure: stop # abort the run if this checkpoint fails | ||
| graders: | ||
| - type: tool_calls | ||
| required: ["read_file"] | ||
| ``` |
Comment on lines
+207
to
+213
| // Status is StatusPassed when every grader in this checkpoint passed, | ||
| // StatusFailed when at least one grader failed. | ||
| Status Status `json:"status"` | ||
| // Validations maps grader identifier to result, identical to | ||
| // RunResult.Validations. | ||
| Validations map[string]GraderResults `json:"validations"` | ||
| // Stopped is true when this checkpoint had `on_failure: stop` and at |
Comment on lines
+1232
to
+1237
| // Surface checkpoint failures in the run status even when graders are | ||
| // skipped or when the final-pass graders all passed. A failed checkpoint | ||
| // without on_failure: stop should still mark the run as failed; a | ||
| // checkpoint that recorded StatusError (grader-execution error) should | ||
| // promote the run to StatusError so consumers can distinguish | ||
| // infrastructure problems from assertion failures. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements #358 — per-turn checkpoint graders for multi-turn evals.
By default graders only run once, against the final state of a conversation. For long multi-turn runs that's too coarse — a task can drift halfway through and the final-pass grader misses it. This PR adds an additive, opt-in
checkpoints:field to task YAML so authors can run any existing grader against the cumulative conversation state at specific turn boundaries.Scope
revised/358.md(Wave-2 source of truth) scopes this to per-turn grading; resume is tracked separately.checkpoints:block) load unchanged.schemaVersionbumped to 1.1 per the MINOR-bump additive policy from feat: schema versioning policy (closes #368) #382.Design
Checkpoint{AfterTurn, Graders, OnFailure}andCheckpointOutcome{AfterTurn, Status, Validations, Stopped}.TestCase.Checkpointsfield + validation (after_turn ≥ 1, no duplicates, no exceed-turns, only when responder is nil for static fans).internal/orchestration/checkpoints.gocheckpointRunnerhelper. Hooks fire after the initial turn (inexecuteRun), after each static follow-up turn (inexecuteFollowUps), and after each responder reply (inexecuteResponderLoop). ReusesbuildGraderContext+graders.RunAllvia a stubTestCaseso checkpoint graders run independently of task-level validators.RunResult.Checkpoints []CheckpointOutcomepropagated.waza gatestill keys off final-pass status. Checkpoint failures flip status tofailed(orerrorwhenon_failure: stopfired andresp.ErrorMsgis set).--skip-graderson_failure: stopcheckpoints.Test plan
internal/models/testcase_test.go—TestLoadTestCase_CheckpointsParsed,TestCheckpointValidation(7 subtests: invalid after_turn, missing graders, duplicate after_turn, exceeds turns, invalid on_failure, valid, responder upper-bound).internal/orchestration/runner_orchestration_test.go— 5 new integration tests:TestExecuteRun_Checkpoints_PassAfterEachTurnTestExecuteRun_Checkpoints_FailContinue— all turns run, Status=FailedTestExecuteRun_Checkpoints_FailStop— short-circuits at turn 2, Stopped=true, Status=ErrorTestExecuteRun_Checkpoints_BackwardCompat— no checkpoints →Checkpoints == nilTestExecuteRun_Checkpoints_SkipGradersHonored—WithSkipGradersskips both grading and the stop signalinternal/models/outcome_schema_test.go— updated to expectschemaVersion: "1.1".cmd/waza/cmd_migrate_test.go— updated messages for the new current version.Docs
site/src/content/docs/reference/schema.mdx— new### checkpointssection under the task spec, plus default schemaVersion bump.site/src/content/docs/guides/eval-yaml.mdx— new "Per-Turn Checkpoints" subsection under Responder.site/src/content/docs/reference/schema-changes.md— added 1.1 changelog entry; bumped current-version table.Notes for reviewers
i= turni+2, responder reply N = turnN+1.web/) is not updated in this PR — surfacing checkpoint outcomes in the UI is intentionally deferred to a follow-up; the new fields are already serialized toresults.json.Closes #358