prompt grader false-fails / wedges on 'waiting for session.idle: context deadline exceeded' for long multi-turn judge sessions

## Summary

When a `prompt` grader evaluates a long / heavily-loaded judge session, `Session.SendAndWait` can exhaust its deadline waiting for the `session.idle` event and return `waiting for session.idle: context deadline exceeded`. waza surfaces it as:

```
running graders: failed to run grader rubric_judge: failed to send prompt: waiting for session.idle: context deadline exceeded
```

Two bad outcomes:

1. **False failure** when no grade tool calls have fired yet (`total == 0`): the task is recorded as a hard grading error even though the agent run itself was fine and gradeable.
2. **Wedge + wasted wall-clock** when grades *were* collected (`total > 0`): waza correctly keeps the grades, but the underlying copilot session never goes idle, so the process hangs for a long time during teardown before an external watchdog has to reap it.

## Where it happens (call chain)

```
internal/graders/prompt_grader.go  executePromptGrader
  └─ context.WithTimeout(ctx, promptGraderTimeout)         // hardcoded 120s (prompt_grader.go)
     └─ internal/execution/copilot.go  CopilotEngine.Execute
        └─ session.SendAndWait(ctx)                         // copilot.go:467
           └─ github.com/github/copilot-sdk/go session.go  // v1.0.0 SendAndWait
              select {
              case <-idleCh:     return result, nil         // got session.idle
              case err := <-errCh: return nil, err
              case <-ctx.Done(): // TODO: remove once session.Send honors the context
                 return nil, fmt.Errorf("waiting for session.idle: %w", ctx.Err())
              }
```

`SendAndWait` detects completion **only** via a `session.idle` event. The SDK itself flags the gap with `// TODO: remove once session.Send honors the context`, and its own doc notes the wait *"does not abort in-flight agent work."* So when the CLI session doesn't emit `session.idle` in time, the only exit is the context deadline — and the in-flight turn keeps running (orphaned), which is what produces the long teardown hang.

## Contributing factors

1. **Unnecessary post-grade follow-up turn.** waza already documents this in `prompt_grader.go`: *"The SDK unconditionally sends tool results back to the model after the grade tool calls fire, which starts a follow-up assistant turn."* For grading that follow-up turn is useless, but it keeps the session busy and is the thing that tends not to reach `session.idle` — the precise trigger for the post-grading wedge.
2. **Hardcoded 120s grading budget** (`promptGraderTimeout`). The agent execution gets the full suite timeout (often hours), but the judge gets only 120s — too tight for a cheap model evaluating a long multi-turn transcript.

## Impact

- Passing evals reported as failures (depresses pass-rate on release-gate suites).
- Tens of minutes of wasted wall-clock per wedge.
- Orphaned copilot subprocesses that need external reaping.

Observed downstream on multi-turn "golden" lifecycle suites: the wedge appears **only** on long multi-turn judge runs (12/12 occurrences in multi-turn suites, 0 in single/short-turn suites), consistent with the longer a session runs, the less reliably it reaches `session.idle` for the judge prompt.

## Proposed fixes

- **waza (small, this issue's PR):** make `promptGraderTimeout` configurable via `WAZA_PROMPT_GRADER_TIMEOUT` (default unchanged) so operators can extend the budget for heavy judge sessions. An escape hatch, not a root-cause fix.
- **waza (follow-up):** consider returning as soon as grades are collected rather than waiting out the wasteful follow-up turn for `session.idle`; consider a more generous / execution-scaled default.
- **copilot SDK (root cause):** resolve the `SendAndWait` `// TODO` — honor context cancellation and abort/clean up in-flight work so a stuck session returns promptly without orphaning a subprocess.

## Repro

Run a long multi-turn eval that uses a `prompt` (LLM-judge) grader against the live copilot engine (e.g. a lifecycle suite graded by a cheap judge model). Observe the grader fail with `failed to send prompt: … waiting for session.idle: context deadline exceeded`, and the process hang during teardown before it is reaped.

## Environment

- waza `main`; copilot-sdk `go v1.0.0`.
- Reproduced on Windows with a multi-turn judge suite; the mechanism is platform-independent (it's a session-lifecycle wait).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

prompt grader false-fails / wedges on 'waiting for session.idle: context deadline exceeded' for long multi-turn judge sessions #318

Summary

Where it happens (call chain)

Contributing factors

Impact

Proposed fixes

Repro

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

prompt grader false-fails / wedges on 'waiting for session.idle: context deadline exceeded' for long multi-turn judge sessions #318

Description

Summary

Where it happens (call chain)

Contributing factors

Impact

Proposed fixes

Repro

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions