Summary
When a prompt grader evaluates a long / heavily-loaded judge session, Session.SendAndWait can exhaust its deadline waiting for the session.idle event and return waiting for session.idle: context deadline exceeded. waza surfaces it as:
running graders: failed to run grader rubric_judge: failed to send prompt: waiting for session.idle: context deadline exceeded
Two bad outcomes:
- False failure when no grade tool calls have fired yet (
total == 0): the task is recorded as a hard grading error even though the agent run itself was fine and gradeable.
- Wedge + wasted wall-clock when grades were collected (
total > 0): waza correctly keeps the grades, but the underlying copilot session never goes idle, so the process hangs for a long time during teardown before an external watchdog has to reap it.
Where it happens (call chain)
internal/graders/prompt_grader.go executePromptGrader
└─ context.WithTimeout(ctx, promptGraderTimeout) // hardcoded 120s (prompt_grader.go)
└─ internal/execution/copilot.go CopilotEngine.Execute
└─ session.SendAndWait(ctx) // copilot.go:467
└─ github.com/github/copilot-sdk/go session.go // v1.0.0 SendAndWait
select {
case <-idleCh: return result, nil // got session.idle
case err := <-errCh: return nil, err
case <-ctx.Done(): // TODO: remove once session.Send honors the context
return nil, fmt.Errorf("waiting for session.idle: %w", ctx.Err())
}
SendAndWait detects completion only via a session.idle event. The SDK itself flags the gap with // TODO: remove once session.Send honors the context, and its own doc notes the wait "does not abort in-flight agent work." So when the CLI session doesn't emit session.idle in time, the only exit is the context deadline — and the in-flight turn keeps running (orphaned), which is what produces the long teardown hang.
Contributing factors
- Unnecessary post-grade follow-up turn. waza already documents this in
prompt_grader.go: "The SDK unconditionally sends tool results back to the model after the grade tool calls fire, which starts a follow-up assistant turn." For grading that follow-up turn is useless, but it keeps the session busy and is the thing that tends not to reach session.idle — the precise trigger for the post-grading wedge.
- Hardcoded 120s grading budget (
promptGraderTimeout). The agent execution gets the full suite timeout (often hours), but the judge gets only 120s — too tight for a cheap model evaluating a long multi-turn transcript.
Impact
- Passing evals reported as failures (depresses pass-rate on release-gate suites).
- Tens of minutes of wasted wall-clock per wedge.
- Orphaned copilot subprocesses that need external reaping.
Observed downstream on multi-turn "golden" lifecycle suites: the wedge appears only on long multi-turn judge runs (12/12 occurrences in multi-turn suites, 0 in single/short-turn suites), consistent with the longer a session runs, the less reliably it reaches session.idle for the judge prompt.
Proposed fixes
- waza (small, this issue's PR): make
promptGraderTimeout configurable via WAZA_PROMPT_GRADER_TIMEOUT (default unchanged) so operators can extend the budget for heavy judge sessions. An escape hatch, not a root-cause fix.
- waza (follow-up): consider returning as soon as grades are collected rather than waiting out the wasteful follow-up turn for
session.idle; consider a more generous / execution-scaled default.
- copilot SDK (root cause): resolve the
SendAndWait // TODO — honor context cancellation and abort/clean up in-flight work so a stuck session returns promptly without orphaning a subprocess.
Repro
Run a long multi-turn eval that uses a prompt (LLM-judge) grader against the live copilot engine (e.g. a lifecycle suite graded by a cheap judge model). Observe the grader fail with failed to send prompt: … waiting for session.idle: context deadline exceeded, and the process hang during teardown before it is reaped.
Environment
- waza
main; copilot-sdk go v1.0.0.
- Reproduced on Windows with a multi-turn judge suite; the mechanism is platform-independent (it's a session-lifecycle wait).
Summary
When a
promptgrader evaluates a long / heavily-loaded judge session,Session.SendAndWaitcan exhaust its deadline waiting for thesession.idleevent and returnwaiting for session.idle: context deadline exceeded. waza surfaces it as:Two bad outcomes:
total == 0): the task is recorded as a hard grading error even though the agent run itself was fine and gradeable.total > 0): waza correctly keeps the grades, but the underlying copilot session never goes idle, so the process hangs for a long time during teardown before an external watchdog has to reap it.Where it happens (call chain)
SendAndWaitdetects completion only via asession.idleevent. The SDK itself flags the gap with// TODO: remove once session.Send honors the context, and its own doc notes the wait "does not abort in-flight agent work." So when the CLI session doesn't emitsession.idlein time, the only exit is the context deadline — and the in-flight turn keeps running (orphaned), which is what produces the long teardown hang.Contributing factors
prompt_grader.go: "The SDK unconditionally sends tool results back to the model after the grade tool calls fire, which starts a follow-up assistant turn." For grading that follow-up turn is useless, but it keeps the session busy and is the thing that tends not to reachsession.idle— the precise trigger for the post-grading wedge.promptGraderTimeout). The agent execution gets the full suite timeout (often hours), but the judge gets only 120s — too tight for a cheap model evaluating a long multi-turn transcript.Impact
Observed downstream on multi-turn "golden" lifecycle suites: the wedge appears only on long multi-turn judge runs (12/12 occurrences in multi-turn suites, 0 in single/short-turn suites), consistent with the longer a session runs, the less reliably it reaches
session.idlefor the judge prompt.Proposed fixes
promptGraderTimeoutconfigurable viaWAZA_PROMPT_GRADER_TIMEOUT(default unchanged) so operators can extend the budget for heavy judge sessions. An escape hatch, not a root-cause fix.session.idle; consider a more generous / execution-scaled default.SendAndWait// TODO— honor context cancellation and abort/clean up in-flight work so a stuck session returns promptly without orphaning a subprocess.Repro
Run a long multi-turn eval that uses a
prompt(LLM-judge) grader against the live copilot engine (e.g. a lifecycle suite graded by a cheap judge model). Observe the grader fail withfailed to send prompt: … waiting for session.idle: context deadline exceeded, and the process hang during teardown before it is reaped.Environment
main; copilot-sdkgo v1.0.0.