Skip to content

bug: prompt grader discards collected grades when follow-up turn fails after tool results #250

Description

@sebastienlevert

Summary

The prompt grader discards already-collected grades when the Copilot SDK's follow-up turn fails after tool results are sent back to the model. This causes evaluations to report score=0.00 and status=error even though the judge model successfully graded every criterion.

Reproduction

  1. Create an eval with a prompt grader using continue_session: true
  2. Run it with waza run eval.yaml --verbose --debug
  3. Observe in debug output:
    • Judge model responds with set_waza_grade_pass / set_waza_grade_fail tool calls (all collected successfully)
    • SDK sends tool results back to the model
    • Model starts a new assistant.turn_start (follow-up turn)
    • Follow-up turn fails: Failed to get response from the AI model; retried 5 times
    • SendAndWait returns error
    • gradeIndependent propagates the error, discarding the grades

Debug Event Timeline

T14:45:33  tool.execution_complete   <- All 5 grade tools completed successfully
T14:45:33  assistant.turn_end        <- Judge turn ended
T14:45:33  assistant.turn_start      <- SDK starts ANOTHER turn (sending tool results back)
T14:45:37  session.info              <- 5 retries, ~4s apart
T14:45:41  session.info
T14:45:44  session.info
T14:45:48  session.info
T14:45:52  session.info
T14:45:56  session.error             <- "Failed to get response from the AI model"

Root Cause

In prompt_grader.go:gradeIndependent(), the error check after SendAndWait does not inspect whether grade tool calls were already collected:

resp, err := session.SendAndWait(ctx, copilot.MessageOptions{...})
if err != nil {
    return nil, fmt.Errorf("failed to send prompt: %w", err)  // grades discarded
}

The Copilot SDK (copilot-sdk/go) unconditionally sends tool results back to the model via HandlePendingToolCall. The LLM API protocol requires the model to respond after receiving tool results, so the SDK starts another assistant turn. When that follow-up turn fails (rate limiting, context window, transient error), SendAndWait returns an error — even though the grade data is already in wazaTools.Passes and wazaTools.Failures.

Impact

  • Every prompt grader invocation is affected when the follow-up turn fails
  • Scores that should be 0.60, 0.80, 1.00 are reported as 0.00
  • The pairwise grader has the same pattern (runPairwiseOnce)
  • Intermittent — depends on whether the follow-up model call succeeds

Environment

  • waza v0.31.0
  • copilot-sdk/go v0.1.32
  • Windows 11, Copilot CLI 1.0.51-2
  • Models tested: claude-sonnet-4.5, claude-opus-4.6 (both exhibit the issue)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions