fix(execution): add a first-event watchdog to catch session-start hangs#321
Merged
github-actions[bot] merged 2 commits intoJun 15, 2026
Merged
Conversation
A run whose embedded engine launches but never produces a first session event is currently only caught by the overall per-task turn timeout (timeout_seconds). Because that timeout must be sized for the slowest legitimate full turn, a no-first-turn wedge can block for many minutes (or hours) and surfaces as the generic "session.idle: context deadline exceeded", indistinguishable from a genuinely long-running turn. Add a time-to-first-event guard distinct from timeout_seconds: - SessionEventsCollector exposes FirstEvent(), a channel closed on the first event of any type. - CopilotEngine.Execute arms a short timer before SendAndWait that cancels a child context (with a distinct cause) when no event arrives in time, and disarms it the instant the first event lands. The cause maps to a distinct, actionable "session start timeout" error so a session-start hang is not conflated with a normal long turn that timed out. - New opt-in config first_event_timeout_seconds (eval Config + per-task override), default 0 (disabled), resolved in buildExecutionRequest. It fires only on the genuine no-events signature and errs toward never false-aborting a slow-but-live first turn (any event disarms it). Includes unit tests, JSON schema entries, and docs. Refs microsoft#320 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds a configurable “time to first session event” watchdog to abort session-start hangs (engine launches but produces no events) more quickly than the overall per-run timeout.
Changes:
- Introduces
first_event_timeout_secondsat eval and task scope (schemas + docs) with0meaning disabled. - Threads the resolved timeout through
EvalRunnerintoExecutionRequest, and implements the watchdog in the Copilot engine. - Adds unit tests covering config resolution/validation and Copilot first-event watchdog behavior.
Show a summary per file
| File | Description |
|---|---|
| site/src/content/docs/reference/schema.mdx | Documents the new config field and recommended sizing relative to timeout_seconds. |
| site/src/content/docs/guides/eval-yaml.mdx | Adds the new config field to the eval YAML configuration table. |
| schemas/task.schema.json | Adds task-level first_event_timeout_seconds override to the JSON schema. |
| schemas/eval.schema.json | Adds eval-level first_event_timeout_seconds to the JSON schema. |
| internal/orchestration/runner_test.go | Tests spec/task resolution rules for the first-event timeout. |
| internal/orchestration/runner.go | Passes resolved first-event timeout into the execution request. |
| internal/models/testcase_test.go | Tests testcase YAML validation for negative/zero first-event timeout values. |
| internal/models/testcase.go | Adds testcase field + validation for first_event_timeout_seconds. |
| internal/models/spec_test.go | Tests eval spec validation for negative/zero/positive first-event timeout values. |
| internal/models/spec.go | Adds eval config field + validation for first_event_timeout_seconds. |
| internal/execution/session_events_collector.go | Adds a first-event signal channel closed on the first received session event. |
| internal/execution/engine.go | Extends ExecutionRequest with FirstEventTimeout. |
| internal/execution/copilot_first_event_test.go | Adds tests verifying watchdog abort/disarm behavior and collector signaling. |
| internal/execution/copilot.go | Implements the watchdog via a cancelable context and a timer. |
Copilot's findings
- Files reviewed: 14/14 changed files
- Comments generated: 3
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #321 +/- ##
=======================================
Coverage ? 75.45%
=======================================
Files ? 160
Lines ? 18784
Branches ? 0
=======================================
Hits ? 14174
Misses ? 3606
Partials ? 1004
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
A run whose embedded engine launches but never produces a first session event
is currently only caught by the overall per-task turn timeout (
timeout_seconds).Because that timeout must be sized for the slowest legitimate full turn, a
session-start hang can block for many minutes (or hours) and surfaces as the
generic
session.idle: context deadline exceeded. This adds an opt-intime-to-first-event watchdog (
first_event_timeout_seconds) that aborts sucha wedge in seconds with a distinct, actionable error, while never false-aborting a
slow-but-live first turn.
Related issue
Closes #320
Agent handoff
internal/execution(engine + collector),internal/models(config),internal/orchestration(wiring), plus JSON schema + docs. No behavior changeunless
first_event_timeout_secondsis set (default0= disabled).internal/execution/session_events_collector.go—FirstEvent()channel,closed on the first event of any type.
internal/execution/copilot.go— arms a first-event timer beforeSendAndWait, cancels a child context with a distinct cause(
errFirstEventTimeout) on timeout, disarms on the first event, and maps thecause to a distinct
session start timeout: no first turn within …error.internal/execution/engine.go—ExecutionRequest.FirstEventTimeout.internal/models/spec.go/testcase.go—first_event_timeout_seconds(eval
Config+ per-task*intoverride) and validation (>= 0;0disables).internal/orchestration/runner.go—firstEventTimeout(tc)resolver, set onevery request in
buildExecutionRequest(covers initial + follow-up turns).schemas/eval.schema.json,schemas/task.schema.json— new field (requiredbecause
configusesadditionalProperties: false).site/.../reference/schema.mdx,site/.../guides/eval-yaml.mdx— docs.0(disabled) so existing specs are byte-for-byteunchanged; downstream long-turn suites enable it explicitly.
events at all, so any event proves the engine is live; this errs toward never
false-aborting a slow first turn.
context.WithCancelCause+context.Causeis used (race-free) todistinguish a first-event abort from the overall
timeout_secondsdeadline andfrom
CancelOnSkillInvocation.bootstrap/first-request deadline in
github/copilot-sdk(surface asession.errorinstead of silence); waza defends regardless. Choosing asensible non-zero default could be a later change once it's been exercised in CI.
Type of change
Validation
go test ./...— full suite green locally (incl. the new tests below).make lintorgolangci-lint run—golangci-lint(v2) reports zeroissues on the changed files (
gofmt-clean,govet enable-all,errcheck,staticcheck).reference and the eval-YAML config table.
web/changed — n/a (noweb/change).no-event session near its tiny budget rather than the large overall deadline,
(b) it disarms when the first event arrives (no false abort), (c) the collector
signal closes on the first event and is idempotent, (d) the runner resolves
spec/task/disable, and (e) config validation rejects negatives / allows
0.New tests:
internal/execution/copilot_first_event_test.go—TestSessionEventsCollector_FirstEvent_ClosesOnFirstEvent,TestCopilotExecute_FirstEventTimeout_AbortsSessionStartHang,TestCopilotExecute_FirstEventTimeout_DisarmsOnFirstEvent.internal/orchestration/runner_test.go—TestFirstEventTimeout_Resolution.internal/models/{testcase,spec}_test.go— negative-rejected / zero-allowed.Documentation
site/docs updated — schema reference + eval-YAML config table.Risk and rollback
first_event_timeout_secondsunset/
0,Executetakes exactly the prior path (sendCtx == ctx, no timer,no goroutine) and all existing tests pass unchanged.
schema-breaking change (the new property is additive and optional).
Notes for reviewers
copilot.goaroundSendAndWait.Please scrutinize the cancellation precedence:
CancelOnSkillInvocation(checked first) →
errFirstEventTimeout(distinct cause) → the overallcontext.DeadlineExceeded.context.Cause(sendCtx)is what disambiguates them.SessionEventsCollector.On(guardedby a
sync.Once), so it can't double-close and disarms on the very first event.gofmthits on untouched files(
copilot_client_wrappers.go,generate.go,git.go) — a Windows CRLFworking-copy artifact, not present in the LF repo blobs and not in this diff.