Skip to content

fix(execution): add a first-event watchdog to catch session-start hangs#321

Merged
github-actions[bot] merged 2 commits into
microsoft:mainfrom
sebastienlevert:fix/first-event-timeout
Jun 15, 2026
Merged

fix(execution): add a first-event watchdog to catch session-start hangs#321
github-actions[bot] merged 2 commits into
microsoft:mainfrom
sebastienlevert:fix/first-event-timeout

Conversation

@sebastienlevert

Copy link
Copy Markdown
Contributor

Summary

A run whose embedded engine launches but never produces a first session event
is currently only caught by the overall per-task turn timeout (timeout_seconds).
Because that timeout must be sized for the slowest legitimate full turn, a
session-start hang can block for many minutes (or hours) and surfaces as the
generic session.idle: context deadline exceeded. This adds an opt-in
time-to-first-event watchdog (first_event_timeout_seconds) that aborts such
a wedge in seconds with a distinct, actionable error, while never false-aborting a
slow-but-live first turn.

Related issue

Closes #320

Agent handoff

  • Scope: internal/execution (engine + collector), internal/models (config),
    internal/orchestration (wiring), plus JSON schema + docs. No behavior change
    unless first_event_timeout_seconds is set (default 0 = disabled).
  • Key files changed:
    • internal/execution/session_events_collector.goFirstEvent() channel,
      closed on the first event of any type.
    • internal/execution/copilot.go — arms a first-event timer before
      SendAndWait, cancels a child context with a distinct cause
      (errFirstEventTimeout) on timeout, disarms on the first event, and maps the
      cause to a distinct session start timeout: no first turn within … error.
    • internal/execution/engine.goExecutionRequest.FirstEventTimeout.
    • internal/models/spec.go / testcase.gofirst_event_timeout_seconds
      (eval Config + per-task *int override) and validation (>= 0; 0 disables).
    • internal/orchestration/runner.gofirstEventTimeout(tc) resolver, set on
      every request in buildExecutionRequest (covers initial + follow-up turns).
    • schemas/eval.schema.json, schemas/task.schema.json — new field (required
      because config uses additionalProperties: false).
    • site/.../reference/schema.mdx, site/.../guides/eval-yaml.mdx — docs.
  • Important decisions:
    • Opt-in, default 0 (disabled) so existing specs are byte-for-byte
      unchanged; downstream long-turn suites enable it explicitly.
    • Fire on the first event of ANY type — a true session-start hang emits no
      events at all, so any event proves the engine is live; this errs toward never
      false-aborting a slow first turn.
    • context.WithCancelCause + context.Cause is used (race-free) to
      distinguish a first-event abort from the overall timeout_seconds deadline and
      from CancelOnSkillInvocation.
  • Follow-ups / known gaps: the underlying stall likely also warrants a
    bootstrap/first-request deadline in github/copilot-sdk (surface a
    session.error instead of silence); waza defends regardless. Choosing a
    sensible non-zero default could be a later change once it's been exercised in CI.

Type of change

  • Bug fix
  • New feature
  • Documentation update
  • Refactor or maintenance
  • CI/CD or release change

Validation

  • go test ./... — full suite green locally (incl. the new tests below).
  • make lint or golangci-lint rungolangci-lint (v2) reports zero
    issues on the changed files
    (gofmt-clean, govet enable-all,
    errcheck, staticcheck).
  • Docs site checked, if docs changed — added the field to the schema
    reference and the eval-YAML config table.
  • Web/dashboard checks run, if web/ changed — n/a (no web/ change).
  • Manual validation completed: new tests assert (a) the watchdog aborts a
    no-event session near its tiny budget rather than the large overall deadline,
    (b) it disarms when the first event arrives (no false abort), (c) the collector
    signal closes on the first event and is idempotent, (d) the runner resolves
    spec/task/disable, and (e) config validation rejects negatives / allows 0.

New tests:

  • internal/execution/copilot_first_event_test.go
    TestSessionEventsCollector_FirstEvent_ClosesOnFirstEvent,
    TestCopilotExecute_FirstEventTimeout_AbortsSessionStartHang,
    TestCopilotExecute_FirstEventTimeout_DisarmsOnFirstEvent.
  • internal/orchestration/runner_test.goTestFirstEventTimeout_Resolution.
  • internal/models/{testcase,spec}_test.go — negative-rejected / zero-allowed.

Documentation

  • README updated, if user-facing behavior changed — n/a (no README surface).
  • site/ docs updated — schema reference + eval-YAML config table.
  • Examples updated, if relevant — n/a.
  • Not applicable

Risk and rollback

  • Risk level: Low. The feature is fully opt-in: with first_event_timeout_seconds
    unset/0, Execute takes exactly the prior path (sendCtx == ctx, no timer,
    no goroutine) and all existing tests pass unchanged.
  • Rollback plan: revert the commit; no migrations, no persisted state, no
    schema-breaking change (the new property is additive and optional).

Notes for reviewers

  • The core logic is the watchdog block in copilot.go around SendAndWait.
    Please scrutinize the cancellation precedence: CancelOnSkillInvocation
    (checked first) → errFirstEventTimeout (distinct cause) → the overall
    context.DeadlineExceeded. context.Cause(sendCtx) is what disambiguates them.
  • The first-event signal lives at the top of SessionEventsCollector.On (guarded
    by a sync.Once), so it can't double-close and disarms on the very first event.
  • Local lint shows 3 pre-existing gofmt hits on untouched files
    (copilot_client_wrappers.go, generate.go, git.go) — a Windows CRLF
    working-copy artifact, not present in the LF repo blobs and not in this diff.

A run whose embedded engine launches but never produces a first session
event is currently only caught by the overall per-task turn timeout
(timeout_seconds). Because that timeout must be sized for the slowest
legitimate full turn, a no-first-turn wedge can block for many minutes (or
hours) and surfaces as the generic "session.idle: context deadline exceeded",
indistinguishable from a genuinely long-running turn.

Add a time-to-first-event guard distinct from timeout_seconds:

- SessionEventsCollector exposes FirstEvent(), a channel closed on the first
  event of any type.
- CopilotEngine.Execute arms a short timer before SendAndWait that cancels a
  child context (with a distinct cause) when no event arrives in time, and
  disarms it the instant the first event lands. The cause maps to a distinct,
  actionable "session start timeout" error so a session-start hang is not
  conflated with a normal long turn that timed out.
- New opt-in config first_event_timeout_seconds (eval Config + per-task
  override), default 0 (disabled), resolved in buildExecutionRequest.

It fires only on the genuine no-events signature and errs toward never
false-aborting a slow-but-live first turn (any event disarms it). Includes
unit tests, JSON schema entries, and docs.

Refs microsoft#320

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@sebastienlevert sebastienlevert requested a review from spboyer as a code owner June 14, 2026 19:26
Copilot AI review requested due to automatic review settings June 14, 2026 19:26
@github-actions github-actions Bot enabled auto-merge (squash) June 14, 2026 19:26

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a configurable “time to first session event” watchdog to abort session-start hangs (engine launches but produces no events) more quickly than the overall per-run timeout.

Changes:

  • Introduces first_event_timeout_seconds at eval and task scope (schemas + docs) with 0 meaning disabled.
  • Threads the resolved timeout through EvalRunner into ExecutionRequest, and implements the watchdog in the Copilot engine.
  • Adds unit tests covering config resolution/validation and Copilot first-event watchdog behavior.
Show a summary per file
File Description
site/src/content/docs/reference/schema.mdx Documents the new config field and recommended sizing relative to timeout_seconds.
site/src/content/docs/guides/eval-yaml.mdx Adds the new config field to the eval YAML configuration table.
schemas/task.schema.json Adds task-level first_event_timeout_seconds override to the JSON schema.
schemas/eval.schema.json Adds eval-level first_event_timeout_seconds to the JSON schema.
internal/orchestration/runner_test.go Tests spec/task resolution rules for the first-event timeout.
internal/orchestration/runner.go Passes resolved first-event timeout into the execution request.
internal/models/testcase_test.go Tests testcase YAML validation for negative/zero first-event timeout values.
internal/models/testcase.go Adds testcase field + validation for first_event_timeout_seconds.
internal/models/spec_test.go Tests eval spec validation for negative/zero/positive first-event timeout values.
internal/models/spec.go Adds eval config field + validation for first_event_timeout_seconds.
internal/execution/session_events_collector.go Adds a first-event signal channel closed on the first received session event.
internal/execution/engine.go Extends ExecutionRequest with FirstEventTimeout.
internal/execution/copilot_first_event_test.go Adds tests verifying watchdog abort/disarm behavior and collector signaling.
internal/execution/copilot.go Implements the watchdog via a cancelable context and a timer.

Copilot's findings

  • Files reviewed: 14/14 changed files
  • Comments generated: 3

Comment thread internal/execution/copilot.go
Comment thread schemas/eval.schema.json
Comment thread internal/execution/copilot_first_event_test.go Outdated
@codecov-commenter

codecov-commenter commented Jun 14, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 90.00000% with 5 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@9b0b076). Learn more about missing BASE report.

Files with missing lines Patch % Lines
internal/models/testcase.go 57.14% 2 Missing and 1 partial ⚠️
internal/execution/copilot.go 93.10% 2 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main     #321   +/-   ##
=======================================
  Coverage        ?   75.45%           
=======================================
  Files           ?      160           
  Lines           ?    18784           
  Branches        ?        0           
=======================================
  Hits            ?    14174           
  Misses          ?     3606           
  Partials        ?     1004           
Flag Coverage Δ
go-implementation 75.45% <90.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions github-actions Bot merged commit 42a1bb3 into microsoft:main Jun 15, 2026
7 checks passed
@spboyer spboyer mentioned this pull request Jun 15, 2026
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Session-start hang: a run that produces no first event blocks until the full turn timeout (no time-to-first-event guard)

4 participants