Symptom
When the gateway restarts (via hermes gateway restart or launchd KeepAlive), any delegate_task subagents that are in-flight are not cancelled. They become orphaned, lose their API connections, and eventually time out after 12–16 minutes of no response. The user sees:
- Agent idle timeout (781s for main agent, 757s/765s for subagents)
- Subagent interrupted mid-API-call errors
- Lost work from the delegate tasks
Evidence
From gateway.log during a restart event (gateway received restart command at 06:56, old process lingered until 07:15):
No response from provider for 968s (model: deepseek-v4-pro, context: ~199,242 tokens)
No response from provider for 1097s (model: deepseek-v4-pro, context: ~199,242 tokens)
[subagent-0] No response from provider for 757s (model: kimi-k2.6, context: ~3,080 tokens)
[subagent-1] No response from provider for 765s (model: kimi-k2.6, context: ~5,538 tokens)
[subagent-1] Interrupted during API call.
[subagent-0] Interrupted during API call.
All 56 timeout/disconnect/interrupt events in the gateway log across multiple sessions occurred during gateway restart transition periods.
Root Cause
cancel_session_processing in the gateway shutdown path does not recursively cancel in-flight delegate subagents. When the gateway is shutting down:
- The main agent session gets a shutdown signal
- But delegate_task subagents continue running independently
- The gateway process loses its API connections during restart
- Subagents wait indefinitely for responses (757s, 765s observed)
- Eventually they hit connection-level timeouts and are interrupted
Proposed Fix
In the gateway shutdown sequence, before disconnecting platform adapters:
- Collect all active sessions with in-flight delegate tasks
- Cancel each subagent (set interrupt flag, cancel pending API calls)
- Wait for subagents to acknowledge cancellation (with a short grace period, e.g. 5s)
- Only then proceed with transport disconnect and gateway shutdown
Related
Symptom
When the gateway restarts (via
hermes gateway restartor launchd KeepAlive), any delegate_task subagents that are in-flight are not cancelled. They become orphaned, lose their API connections, and eventually time out after 12–16 minutes of no response. The user sees:Evidence
From
gateway.logduring a restart event (gateway received restart command at 06:56, old process lingered until 07:15):All 56 timeout/disconnect/interrupt events in the gateway log across multiple sessions occurred during gateway restart transition periods.
Root Cause
cancel_session_processingin the gateway shutdown path does not recursively cancel in-flight delegate subagents. When the gateway is shutting down:Proposed Fix
In the gateway shutdown sequence, before disconnecting platform adapters:
Related