Summary
The prompt grader discards already-collected grades when the Copilot SDK's follow-up turn fails after tool results are sent back to the model. This causes evaluations to report score=0.00 and status=error even though the judge model successfully graded every criterion.
Reproduction
- Create an eval with a
prompt grader using continue_session: true
- Run it with
waza run eval.yaml --verbose --debug
- Observe in debug output:
- Judge model responds with
set_waza_grade_pass / set_waza_grade_fail tool calls (all collected successfully)
- SDK sends tool results back to the model
- Model starts a new
assistant.turn_start (follow-up turn)
- Follow-up turn fails:
Failed to get response from the AI model; retried 5 times
SendAndWait returns error
gradeIndependent propagates the error, discarding the grades
Debug Event Timeline
T14:45:33 tool.execution_complete <- All 5 grade tools completed successfully
T14:45:33 assistant.turn_end <- Judge turn ended
T14:45:33 assistant.turn_start <- SDK starts ANOTHER turn (sending tool results back)
T14:45:37 session.info <- 5 retries, ~4s apart
T14:45:41 session.info
T14:45:44 session.info
T14:45:48 session.info
T14:45:52 session.info
T14:45:56 session.error <- "Failed to get response from the AI model"
Root Cause
In prompt_grader.go:gradeIndependent(), the error check after SendAndWait does not inspect whether grade tool calls were already collected:
resp, err := session.SendAndWait(ctx, copilot.MessageOptions{...})
if err != nil {
return nil, fmt.Errorf("failed to send prompt: %w", err) // grades discarded
}
The Copilot SDK (copilot-sdk/go) unconditionally sends tool results back to the model via HandlePendingToolCall. The LLM API protocol requires the model to respond after receiving tool results, so the SDK starts another assistant turn. When that follow-up turn fails (rate limiting, context window, transient error), SendAndWait returns an error — even though the grade data is already in wazaTools.Passes and wazaTools.Failures.
Impact
- Every prompt grader invocation is affected when the follow-up turn fails
- Scores that should be 0.60, 0.80, 1.00 are reported as 0.00
- The
pairwise grader has the same pattern (runPairwiseOnce)
- Intermittent — depends on whether the follow-up model call succeeds
Environment
- waza v0.31.0
- copilot-sdk/go v0.1.32
- Windows 11, Copilot CLI 1.0.51-2
- Models tested: claude-sonnet-4.5, claude-opus-4.6 (both exhibit the issue)
Summary
The prompt grader discards already-collected grades when the Copilot SDK's follow-up turn fails after tool results are sent back to the model. This causes evaluations to report
score=0.00andstatus=erroreven though the judge model successfully graded every criterion.Reproduction
promptgrader usingcontinue_session: truewaza run eval.yaml --verbose --debugset_waza_grade_pass/set_waza_grade_failtool calls (all collected successfully)assistant.turn_start(follow-up turn)Failed to get response from the AI model; retried 5 timesSendAndWaitreturns errorgradeIndependentpropagates the error, discarding the gradesDebug Event Timeline
Root Cause
In
prompt_grader.go:gradeIndependent(), the error check afterSendAndWaitdoes not inspect whether grade tool calls were already collected:The Copilot SDK (
copilot-sdk/go) unconditionally sends tool results back to the model viaHandlePendingToolCall. The LLM API protocol requires the model to respond after receiving tool results, so the SDK starts another assistant turn. When that follow-up turn fails (rate limiting, context window, transient error),SendAndWaitreturns an error — even though the grade data is already inwazaTools.PassesandwazaTools.Failures.Impact
pairwisegrader has the same pattern (runPairwiseOnce)Environment