Skip to content

Session-start hang: a run that produces no first event blocks until the full turn timeout (no time-to-first-event guard) #320

Description

@sebastienlevert

Summary

A run whose embedded engine launches but never produces a first session event
(no assistant message, no tool call, nothing) is only caught by the overall
per-task turn timeout (timeout_seconds). Because that timeout must be sized for
the slowest legitimate full turn, a fast "session-start hang" can block for many
minutes — or hours — before it is reaped, and it surfaces as the generic
waiting for session.idle: context deadline exceeded, indistinguishable from a
genuinely long-running turn that timed out.

There is no separate time-to-first-event guard.

Where it comes from (code references, main)

The whole turn is one blocking call governed by a single deadline:

  • internal/orchestration/runner.goexecuteRun wraps the entire Execute
    call in context.WithTimeout(ctx, timeout) where timeout comes from
    executionTimeout(tc), i.e. only Config.TimeoutSec / TestCase.TimeoutSec.
    That single deadline must cover the slowest legitimate full turn.
  • internal/execution/copilot.goExecute calls
    session.SendAndWait(ctx, …), which blocks until the session reaches a
    terminal state or ctx expires.
  • internal/execution/session_events_collector.go — the collector's done
    channel is closed only on a terminal SessionEventTypeSessionIdle /
    SessionEventTypeSessionError. A session that emits no events at all never
    signals done, so SendAndWait blocks for the full ctx deadline.

internal/execution/engine.go even documents that the deadline is entirely the
caller's: "The caller controls cancellation and deadlines through ctx." There is
no internal first-event guard anywhere (a repo search for a "first event" /
"time to first" timeout finds only an unrelated PRD metric).

Why one coarse timeout can't catch it

timeout_seconds is necessarily large for long multi-turn / lifecycle evals
(e.g. a downstream consumer runs golden suites at timeout_seconds: 14400 = 4 h).
A no-first-turn wedge therefore burns up to that entire budget before being
reaped — turning a fast failure into a multi-minute/hour stall and losing the
data point — even though a healthy session emits its first event within seconds.

Observed symptom (downstream)

A copilot-sdk run launched the embedded engine, logged
level=INFO msg="using Copilot CLI" … copilot_1.0.49.exe, then emitted nothing
further
— no transcript, no grades — until an external watchdog killed it ~38
min in. The captured output was only the process startup banner (~251 bytes);
ExecutionResponse.Events was empty.

Proposed fix

Add a time-to-first-event deadline distinct from timeout_seconds:

  • Signal the first event of any type from SessionEventsCollector (a FirstEvent()
    channel closed on the first On(...) call).
  • In copilot.go Execute, arm a short timer before SendAndWait that cancels a
    child context (with a distinct cause) if no event arrives in time, and disarm it
    the instant the first event lands. Map that cancellation cause to a distinct,
    actionable error (e.g. session start timeout: no first turn within Ns) so a
    session-start hang is not conflated with a normal long turn that timed out.
  • New opt-in config first_event_timeout_seconds (eval Config + per-task
    override), defaulting to 0 (disabled) so existing specs are unchanged.

This fires on the genuine signature (no events at all) and errs toward never
false-aborting a slow-but-live first turn (any event disarms it). Keeping the
window comfortably above realistic first-turn latency avoids false positives.

A deeper fix could also live in github/copilot-sdk (surface a session.error
when session bootstrap / the first model request stalls), but waza should defend
regardless — an embedded engine should never be able to wedge the runner for the
full (large) turn budget.

I'll open a PR implementing the above.

Environment

Windows, embedded Copilot SDK copilot_1.0.49, copilot-sdk executor, agent on a
long multi-turn (golden/lifecycle) suite with a large timeout_seconds.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions