Problem
Waza already supports multi-turn flows via inputs.follow_up_prompts and the responder model. The real gap is that per-turn assertions aren't first class:
- Graders run once at the end of the conversation.
- There's no way to express "after turn 2, the agent should have called tool X" or "before turn 3, no PII should have been emitted."
- Failures in early turns get masked by later recovery; conversely, a correct final answer can hide a broken middle step.
So authors can compose multi-turn inputs, but can't grade the conversation — only its terminal state.
Proposal
Add per-turn checkpoints layered on top of the existing multi-turn model. No breaking change to inputs.prompt or inputs.follow_up_prompts.
inputs:
prompt: "..."
follow_up_prompts: ["..."]
checkpoints:
- after_turn: 1
graders:
- type: tool_calls
must_include: ["search_docs"]
- after_turn: 2
graders:
- type: behavior
forbidden: ["leaked credentials"]
- Each checkpoint runs the listed graders against the state at that turn boundary.
- Existing top-level
graders: continues to run against the final state.
results.json records per-turn outcomes (pass/fail per checkpoint).
Why this matters for agentic-first
Agentic skills make decisions over time. A skill that gets the right answer the wrong way (wrong tool, wrong order, wrong intermediate state) is still buggy. Per-turn checkpoints let authors lock down how the agent got there.
Acceptance criteria
Non-goals (filed separately)
Related
Problem
Waza already supports multi-turn flows via
inputs.follow_up_promptsand the responder model. The real gap is that per-turn assertions aren't first class:So authors can compose multi-turn inputs, but can't grade the conversation — only its terminal state.
Proposal
Add per-turn checkpoints layered on top of the existing multi-turn model. No breaking change to
inputs.promptorinputs.follow_up_prompts.graders:continues to run against the final state.results.jsonrecords per-turn outcomes (pass/fail per checkpoint).Why this matters for agentic-first
Agentic skills make decisions over time. A skill that gets the right answer the wrong way (wrong tool, wrong order, wrong intermediate state) is still buggy. Per-turn checkpoints let authors lock down how the agent got there.
Acceptance criteria
checkpoints:field added to task schema; backward compatible (existing tasks with only top-levelgraderskeep working).results.jsonand dashboard.site/with an example.Non-goals (filed separately)
Related
inputs.follow_up_prompts, responder model