Skip to content

feat: per-turn checkpoint graders (closes #358)#386

Merged
spboyer merged 3 commits into
mainfrom
spboyer-feat-per-turn-checkpoints
Jun 28, 2026
Merged

feat: per-turn checkpoint graders (closes #358)#386
spboyer merged 3 commits into
mainfrom
spboyer-feat-per-turn-checkpoints

Conversation

@spboyer

@spboyer spboyer commented Jun 28, 2026

Copy link
Copy Markdown
Member

Summary

Implements #358per-turn checkpoint graders for multi-turn evals.

By default graders only run once, against the final state of a conversation. For long multi-turn runs that's too coarse — a task can drift halfway through and the final-pass grader misses it. This PR adds an additive, opt-in checkpoints: field to task YAML so authors can run any existing grader against the cumulative conversation state at specific turn boundaries.

checkpoints:
  - after_turn: 1
    graders:
      - type: text
        contains: ["analyzing", "files"]
  - after_turn: 2
    on_failure: stop
    graders:
      - type: tool_calls
        required: ["read_file"]

Scope

  • Per-turn graders, NOT eval-resume / pause-and-resume. The issue title mentions "checkpoint + resume" but the revised spec at revised/358.md (Wave-2 source of truth) scopes this to per-turn grading; resume is tracked separately.
  • Backward compatible: 1.0 task files (no checkpoints: block) load unchanged. schemaVersion bumped to 1.1 per the MINOR-bump additive policy from feat: schema versioning policy (closes #368) #382.

Design

Layer Change
Models New Checkpoint{AfterTurn, Graders, OnFailure} and CheckpointOutcome{AfterTurn, Status, Validations, Stopped}. TestCase.Checkpoints field + validation (after_turn ≥ 1, no duplicates, no exceed-turns, only when responder is nil for static fans).
Orchestration New internal/orchestration/checkpoints.go checkpointRunner helper. Hooks fire after the initial turn (in executeRun), after each static follow-up turn (in executeFollowUps), and after each responder reply (in executeResponderLoop). Reuses buildGraderContext + graders.RunAll via a stub TestCase so checkpoint graders run independently of task-level validators.
Results RunResult.Checkpoints []CheckpointOutcome propagated. waza gate still keys off final-pass status. Checkpoint failures flip status to failed (or error when on_failure: stop fired and resp.ErrorMsg is set).
--skip-graders Honored: checkpoint evaluation short-circuits identically to final-pass grading, even for on_failure: stop checkpoints.

Test plan

  • internal/models/testcase_test.goTestLoadTestCase_CheckpointsParsed, TestCheckpointValidation (7 subtests: invalid after_turn, missing graders, duplicate after_turn, exceeds turns, invalid on_failure, valid, responder upper-bound).
  • internal/orchestration/runner_orchestration_test.go — 5 new integration tests:
    • TestExecuteRun_Checkpoints_PassAfterEachTurn
    • TestExecuteRun_Checkpoints_FailContinue — all turns run, Status=Failed
    • TestExecuteRun_Checkpoints_FailStop — short-circuits at turn 2, Stopped=true, Status=Error
    • TestExecuteRun_Checkpoints_BackwardCompat — no checkpoints → Checkpoints == nil
    • TestExecuteRun_Checkpoints_SkipGradersHonoredWithSkipGraders skips both grading and the stop signal
  • internal/models/outcome_schema_test.go — updated to expect schemaVersion: "1.1".
  • cmd/waza/cmd_migrate_test.go — updated messages for the new current version.
  • Full suite, vet, and the site build are green.

Docs

  • site/src/content/docs/reference/schema.mdx — new ### checkpoints section under the task spec, plus default schemaVersion bump.
  • site/src/content/docs/guides/eval-yaml.mdx — new "Per-Turn Checkpoints" subsection under Responder.
  • site/src/content/docs/reference/schema-changes.md — added 1.1 changelog entry; bumped current-version table.

Notes for reviewers

  • Turn numbering: initial prompt = turn 1, static follow-up index i = turn i+2, responder reply N = turn N+1.
  • Dashboard (web/) is not updated in this PR — surfacing checkpoint outcomes in the UI is intentionally deferred to a follow-up; the new fields are already serialized to results.json.

Closes #358

Add an additive `checkpoints:` field to task YAML so multi-turn evals
can grade conversation state at specific turn boundaries instead of
only the final output.

- New `Checkpoint` model with after_turn, graders, on_failure (continue/stop)
- New `CheckpointOutcome` recorded per task on results.json
- Per-turn hook in runner (initial + follow_ups + responder loop)
- on_failure: stop aborts remaining turns and flips status to error
- Bumped schemaVersion to 1.1 (additive, MINOR bump per #382 policy)
- Reuses existing grader plumbing (graders.RunAll + buildGraderContext)
- Honors --skip-graders by short-circuiting checkpoint evaluation
- Full unit + integration tests; docs (guide + schema + changelog)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 28, 2026 12:10

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds opt-in per-turn checkpoint grading for multi-turn evals by introducing a checkpoints: block in task YAML (schema bumped to 1.1) and plumbing checkpoint outcomes into results.json run results. This extends the existing grading system so intermediate conversation state can be asserted without changing existing grader implementations.

Changes:

  • Introduces TestCase.Checkpoints + validation and records RunResult.Checkpoints outcomes.
  • Adds orchestration support to run checkpoint graders after specific turns (static follow-ups and responder loop), with optional on_failure: stop.
  • Updates docs and tests to reflect schemaVersion 1.1 and the new YAML surface area.
Show a summary per file
File Description
internal/orchestration/checkpoints.go New checkpoint runner that executes per-turn graders and records outcomes.
internal/orchestration/runner.go Wires checkpoint execution into multi-turn orchestration and propagates outcomes to results + status.
internal/orchestration/runner_orchestration_test.go Integration tests covering pass/fail/stop/backward-compat/skip-graders behavior.
internal/models/testcase.go Adds checkpoint model + validation rules to task schema.
internal/models/testcase_test.go Parsing + validation unit tests for checkpoints.
internal/models/outcome.go Adds RunResult.Checkpoints and CheckpointOutcome result shape.
internal/models/outcome_schema_test.go Updates expected default schemaVersion to 1.1.
internal/models/schema_version.go Bumps CurrentSchemaVersion to 1.1 with inline rationale.
cmd/waza/cmd_migrate_test.go Updates expected migrate messaging for current schemaVersion 1.1.
site/src/content/docs/reference/schema.mdx Documents checkpoints in task schema and updates default schemaVersion to 1.1.
site/src/content/docs/guides/eval-yaml.mdx Adds “Per-Turn Checkpoints” guide section and example.
site/src/content/docs/reference/schema-changes.md Updates current-version table and adds a 1.1 changelog entry.

Review details

  • Files reviewed: 12/12 changed files
  • Comments generated: 6
  • Review effort level: Low

Comment thread internal/orchestration/runner.go
Comment thread internal/orchestration/checkpoints.go
Comment thread internal/orchestration/checkpoints.go
Comment thread site/src/content/docs/reference/schema.mdx
Comment thread site/src/content/docs/guides/eval-yaml.mdx
Comment thread site/src/content/docs/reference/schema-changes.md
Copilot AI added 2 commits June 28, 2026 08:17
- Add Type field to synthesized _checkpoint_error GraderResults
- Fix docs to reference 'graders:' (the actual YAML key) instead of 'validators:'
- Update schema-changes.md Policy section to reflect current 1.1 default emission while preserving 1.0 reader fallback for back-compat

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 28, 2026 12:18
@spboyer spboyer merged commit ac7c35e into main Jun 28, 2026
11 checks passed
@spboyer spboyer deleted the spboyer-feat-per-turn-checkpoints branch June 28, 2026 12:22

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review details

Comments suppressed due to low confidence (1)

internal/models/schema_version.go:28

  • defaultSchemaVersion currently defaults a missing schemaVersion to CurrentSchemaVersion (now 1.1). The PR description + schema policy docs say missing schemaVersion should default to 1.0 for backward compatibility, so this change alters the meaning of legacy artifacts (and also drives waza migrate output/test expectations).
func defaultSchemaVersion(version string) string {
	if strings.TrimSpace(version) == "" {
		return CurrentSchemaVersion
	}
	return version
  • Files reviewed: 12/13 changed files
  • Comments generated: 4
  • Review effort level: Low

Comment on lines +580 to +591
```yaml
checkpoints:
- after_turn: 1
graders:
- type: text
contains: ["analyzing", "files"]
- after_turn: 2
on_failure: stop
graders:
- type: tool_calls
required: ["read_file"]
```
Comment on lines +428 to +439
```yaml
checkpoints:
- after_turn: 1
graders:
- type: text
contains: ["analyzing", "files"]
- after_turn: 2
on_failure: stop # abort the run if this checkpoint fails
graders:
- type: tool_calls
required: ["read_file"]
```
Comment on lines +207 to +213
// Status is StatusPassed when every grader in this checkpoint passed,
// StatusFailed when at least one grader failed.
Status Status `json:"status"`
// Validations maps grader identifier to result, identical to
// RunResult.Validations.
Validations map[string]GraderResults `json:"validations"`
// Stopped is true when this checkpoint had `on_failure: stop` and at
Comment on lines +1232 to +1237
// Surface checkpoint failures in the run status even when graders are
// skipped or when the final-pass graders all passed. A failed checkpoint
// without on_failure: stop should still mark the run as failed; a
// checkpoint that recorded StatusError (grader-execution error) should
// promote the run to StatusError so consumers can distinguish
// infrastructure problems from assertion failures.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Multi-turn conversation evaluation

3 participants