Skip to content

feat: Multi-turn conversation evaluation #358

Description

@spboyer

Problem

Waza already supports multi-turn flows via inputs.follow_up_prompts and the responder model. The real gap is that per-turn assertions aren't first class:

  • Graders run once at the end of the conversation.
  • There's no way to express "after turn 2, the agent should have called tool X" or "before turn 3, no PII should have been emitted."
  • Failures in early turns get masked by later recovery; conversely, a correct final answer can hide a broken middle step.

So authors can compose multi-turn inputs, but can't grade the conversation — only its terminal state.

Proposal

Add per-turn checkpoints layered on top of the existing multi-turn model. No breaking change to inputs.prompt or inputs.follow_up_prompts.

inputs:
  prompt: "..."
  follow_up_prompts: ["..."]
checkpoints:
  - after_turn: 1
    graders:
      - type: tool_calls
        must_include: ["search_docs"]
  - after_turn: 2
    graders:
      - type: behavior
        forbidden: ["leaked credentials"]
  • Each checkpoint runs the listed graders against the state at that turn boundary.
  • Existing top-level graders: continues to run against the final state.
  • results.json records per-turn outcomes (pass/fail per checkpoint).

Why this matters for agentic-first

Agentic skills make decisions over time. A skill that gets the right answer the wrong way (wrong tool, wrong order, wrong intermediate state) is still buggy. Per-turn checkpoints let authors lock down how the agent got there.

Acceptance criteria

  • checkpoints: field added to task schema; backward compatible (existing tasks with only top-level graders keep working).
  • Per-turn checkpoint results surfaced in results.json and dashboard.
  • Reuses existing grader implementations (no new grader types required).
  • Tests cover: checkpoint failure short-circuit vs. continue, mixed checkpoint + final graders, schema validation.
  • Docs updated in site/ with an example.

Non-goals (filed separately)

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    agentic-firstcoding-agentGood candidate for coding-agent implementationenhancementNew feature or requestgo:yesReady to implementrelease:backlogNot yet targetedsquad:copilotAssigned to @copilot (Coding Agent) for autonomous worktype:featureNew capability

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions