Skip to content

fix: prompt grader gracefully recovers when follow-up turn fails after grades collected#251

Merged
github-actions[bot] merged 1 commit into
microsoft:mainfrom
sebastienlevert:fix/prompt-grader-graceful-recovery
May 20, 2026
Merged

fix: prompt grader gracefully recovers when follow-up turn fails after grades collected#251
github-actions[bot] merged 1 commit into
microsoft:mainfrom
sebastienlevert:fix/prompt-grader-graceful-recovery

Conversation

@sebastienlevert

Copy link
Copy Markdown
Contributor

Summary

Fixes #250 — the prompt grader now gracefully recovers when the Copilot SDK's follow-up turn fails after grade tool results are sent back to the model.

Problem

SendAndWait returns an error when the model fails to respond to the follow-up turn after set_waza_grade_pass/set_waza_grade_fail tool results are sent back. The error was propagated unconditionally, discarding the already-collected grades. This caused evaluations to report score=0.00 and status=error even though the judge successfully graded every criterion.

Fix

After SendAndWait returns an error, check if grade tool calls were already collected in wazaTools.Passes/wazaTools.Failures. If they were, log a warning and continue with the actual scores instead of failing. Also handles the nil resp case since SendAndWait returns (nil, error).

Changes

  • internal/graders/prompt_grader.go:
    • Check len(wazaTools.Passes) + len(wazaTools.Failures) > 0 before propagating error
    • Guard resp.Data.Content access for nil resp
    • Log warning with pass/fail counts for observability

Testing

Verified on Windows 11 with waza v0.31.0 (built from source with this patch):

Before patch: score=0.00, status=error — grades discarded

[ERROR] running graders: failed to run grader rubric_judge: failed to send prompt: 
session error: Failed to get response from the AI model; retried 5 times

After patch: score=0.60 — grades recovered (3 pass, 2 fail)

WARN "prompt grader: ignoring post-grade session error (grades already collected)" passes=3 failures=2
[GRADER] rubric_judge score=0.60 (39.835s)

Note

The pairwise grader (runPairwiseOnce) has the same SendAndWait + error pattern and would benefit from the same fix. I kept this PR focused on the gradeIndependent path where I could reproduce and verify the fix.

…r grades collected

The Copilot SDK unconditionally sends tool results back to the model
after set_waza_grade_pass/fail tool calls, starting a follow-up
assistant turn. When that turn fails ('Failed to get response from
the AI model'), SendAndWait returns an error — but the grades were
already collected in wazaTools.Passes/Failures.

Before this fix, the error was propagated and all grade data was
discarded (score=0.00, status=error). Now, if grades were already
collected, the post-grade session error is logged as a warning and
the actual scores are returned.

Also handles the nil resp case when recovering from the error, since
SendAndWait returns (nil, error).

Fixes microsoft#250
@sebastienlevert sebastienlevert requested a review from spboyer as a code owner May 20, 2026 19:02
@github-actions github-actions Bot enabled auto-merge (squash) May 20, 2026 19:02
@codecov-commenter

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 0% with 8 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@63c4908). Learn more about missing BASE report.

Files with missing lines Patch % Lines
internal/graders/prompt_grader.go 0.00% 8 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main     #251   +/-   ##
=======================================
  Coverage        ?   75.69%           
=======================================
  Files           ?      152           
  Lines           ?    17627           
  Branches        ?        0           
=======================================
  Hits            ?    13342           
  Misses          ?     3356           
  Partials        ?      929           
Flag Coverage Δ
go-implementation 75.69% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@github-actions github-actions Bot merged commit e136ce3 into microsoft:main May 20, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: prompt grader discards collected grades when follow-up turn fails after tool results

2 participants