Skip to content

prompt grader false-fails / wedges on 'waiting for session.idle: context deadline exceeded' for long multi-turn judge sessions #318

Description

@sebastienlevert

Summary

When a prompt grader evaluates a long / heavily-loaded judge session, Session.SendAndWait can exhaust its deadline waiting for the session.idle event and return waiting for session.idle: context deadline exceeded. waza surfaces it as:

running graders: failed to run grader rubric_judge: failed to send prompt: waiting for session.idle: context deadline exceeded

Two bad outcomes:

  1. False failure when no grade tool calls have fired yet (total == 0): the task is recorded as a hard grading error even though the agent run itself was fine and gradeable.
  2. Wedge + wasted wall-clock when grades were collected (total > 0): waza correctly keeps the grades, but the underlying copilot session never goes idle, so the process hangs for a long time during teardown before an external watchdog has to reap it.

Where it happens (call chain)

internal/graders/prompt_grader.go  executePromptGrader
  └─ context.WithTimeout(ctx, promptGraderTimeout)         // hardcoded 120s (prompt_grader.go)
     └─ internal/execution/copilot.go  CopilotEngine.Execute
        └─ session.SendAndWait(ctx)                         // copilot.go:467
           └─ github.com/github/copilot-sdk/go session.go  // v1.0.0 SendAndWait
              select {
              case <-idleCh:     return result, nil         // got session.idle
              case err := <-errCh: return nil, err
              case <-ctx.Done(): // TODO: remove once session.Send honors the context
                 return nil, fmt.Errorf("waiting for session.idle: %w", ctx.Err())
              }

SendAndWait detects completion only via a session.idle event. The SDK itself flags the gap with // TODO: remove once session.Send honors the context, and its own doc notes the wait "does not abort in-flight agent work." So when the CLI session doesn't emit session.idle in time, the only exit is the context deadline — and the in-flight turn keeps running (orphaned), which is what produces the long teardown hang.

Contributing factors

  1. Unnecessary post-grade follow-up turn. waza already documents this in prompt_grader.go: "The SDK unconditionally sends tool results back to the model after the grade tool calls fire, which starts a follow-up assistant turn." For grading that follow-up turn is useless, but it keeps the session busy and is the thing that tends not to reach session.idle — the precise trigger for the post-grading wedge.
  2. Hardcoded 120s grading budget (promptGraderTimeout). The agent execution gets the full suite timeout (often hours), but the judge gets only 120s — too tight for a cheap model evaluating a long multi-turn transcript.

Impact

  • Passing evals reported as failures (depresses pass-rate on release-gate suites).
  • Tens of minutes of wasted wall-clock per wedge.
  • Orphaned copilot subprocesses that need external reaping.

Observed downstream on multi-turn "golden" lifecycle suites: the wedge appears only on long multi-turn judge runs (12/12 occurrences in multi-turn suites, 0 in single/short-turn suites), consistent with the longer a session runs, the less reliably it reaches session.idle for the judge prompt.

Proposed fixes

  • waza (small, this issue's PR): make promptGraderTimeout configurable via WAZA_PROMPT_GRADER_TIMEOUT (default unchanged) so operators can extend the budget for heavy judge sessions. An escape hatch, not a root-cause fix.
  • waza (follow-up): consider returning as soon as grades are collected rather than waiting out the wasteful follow-up turn for session.idle; consider a more generous / execution-scaled default.
  • copilot SDK (root cause): resolve the SendAndWait // TODO — honor context cancellation and abort/clean up in-flight work so a stuck session returns promptly without orphaning a subprocess.

Repro

Run a long multi-turn eval that uses a prompt (LLM-judge) grader against the live copilot engine (e.g. a lifecycle suite graded by a cheap judge model). Observe the grader fail with failed to send prompt: … waiting for session.idle: context deadline exceeded, and the process hang during teardown before it is reaped.

Environment

  • waza main; copilot-sdk go v1.0.0.
  • Reproduced on Windows with a multi-turn judge suite; the mechanism is platform-independent (it's a session-lifecycle wait).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions