Skip to content

[Bug]: Gateway restart orphans in-flight delegate subagents — 12+ min timeout #26315

@wjameswen888

Description

@wjameswen888

Symptom

When the gateway restarts (via hermes gateway restart or launchd KeepAlive), any delegate_task subagents that are in-flight are not cancelled. They become orphaned, lose their API connections, and eventually time out after 12–16 minutes of no response. The user sees:

  • Agent idle timeout (781s for main agent, 757s/765s for subagents)
  • Subagent interrupted mid-API-call errors
  • Lost work from the delegate tasks

Evidence

From gateway.log during a restart event (gateway received restart command at 06:56, old process lingered until 07:15):

No response from provider for 968s (model: deepseek-v4-pro, context: ~199,242 tokens)
No response from provider for 1097s (model: deepseek-v4-pro, context: ~199,242 tokens)
[subagent-0] No response from provider for 757s (model: kimi-k2.6, context: ~3,080 tokens)
[subagent-1] No response from provider for 765s (model: kimi-k2.6, context: ~5,538 tokens)
[subagent-1] Interrupted during API call.
[subagent-0] Interrupted during API call.

All 56 timeout/disconnect/interrupt events in the gateway log across multiple sessions occurred during gateway restart transition periods.

Root Cause

cancel_session_processing in the gateway shutdown path does not recursively cancel in-flight delegate subagents. When the gateway is shutting down:

  1. The main agent session gets a shutdown signal
  2. But delegate_task subagents continue running independently
  3. The gateway process loses its API connections during restart
  4. Subagents wait indefinitely for responses (757s, 765s observed)
  5. Eventually they hit connection-level timeouts and are interrupted

Proposed Fix

In the gateway shutdown sequence, before disconnecting platform adapters:

  1. Collect all active sessions with in-flight delegate tasks
  2. Cancel each subagent (set interrupt flag, cancel pending API calls)
  3. Wait for subagents to acknowledge cancellation (with a short grace period, e.g. 5s)
  4. Only then proceed with transport disconnect and gateway shutdown

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium — degraded but workaround existscomp/gatewayGateway runner, session dispatch, deliverysweeper:implemented-on-mainSweeper: behavior already present on current maintool/delegateSubagent delegationtype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions