feat: Multi-turn conversation evaluation

## Problem

Waza already supports multi-turn flows via `inputs.follow_up_prompts` and the responder model. The real gap is that **per-turn assertions** aren't first class:

- Graders run once at the end of the conversation.
- There's no way to express "after turn 2, the agent should have called tool X" or "before turn 3, no PII should have been emitted."
- Failures in early turns get masked by later recovery; conversely, a correct final answer can hide a broken middle step.

So authors can compose multi-turn inputs, but can't grade *the conversation* — only its terminal state.

## Proposal

Add **per-turn checkpoints** layered on top of the existing multi-turn model. No breaking change to `inputs.prompt` or `inputs.follow_up_prompts`.

```yaml
inputs:
  prompt: "..."
  follow_up_prompts: ["..."]
checkpoints:
  - after_turn: 1
    graders:
      - type: tool_calls
        must_include: ["search_docs"]
  - after_turn: 2
    graders:
      - type: behavior
        forbidden: ["leaked credentials"]
```

- Each checkpoint runs the listed graders against the state *at that turn boundary*.
- Existing top-level `graders:` continues to run against the final state.
- `results.json` records per-turn outcomes (pass/fail per checkpoint).

## Why this matters for agentic-first

Agentic skills make decisions over time. A skill that gets the right answer the wrong way (wrong tool, wrong order, wrong intermediate state) is still buggy. Per-turn checkpoints let authors lock down *how* the agent got there.

## Acceptance criteria

- [ ] `checkpoints:` field added to task schema; backward compatible (existing tasks with only top-level `graders` keep working).
- [ ] Per-turn checkpoint results surfaced in `results.json` and dashboard.
- [ ] Reuses existing grader implementations (no new grader types required).
- [ ] Tests cover: checkpoint failure short-circuit vs. continue, mixed checkpoint + final graders, schema validation.
- [ ] Docs updated in `site/` with an example.

## Non-goals (filed separately)

- Cross-engine session/event normalization — depends on #10.
- Scripted assistant replies for deterministic dialog testing — see #367 (snapshot/replay).

## Related

- Existing: `inputs.follow_up_prompts`, responder model
- Roadmap: #66
- Engine event normalization: #10
- Snapshot/replay for deterministic dialogs: #367


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Multi-turn conversation evaluation #358

Problem

Proposal

Why this matters for agentic-first

Acceptance criteria

Non-goals (filed separately)

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

feat: Multi-turn conversation evaluation #358

Description

Problem

Proposal

Why this matters for agentic-first

Acceptance criteria

Non-goals (filed separately)

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions