Describe the bug
When the Anthropic provider stops responding, the gateway logs a 180s timeout and attempts to reconnect. In the observed incident, tool calls (bash, sleep) continued to execute after the reconnect log message was emitted. When systemd eventually sent SIGTERM to the gateway process, it did not clean up its child processes within systemd's stop timeout, causing systemd to SIGKILL the remaining bash and sleep subprocesses.
Note: It is unclear whether the bash/sleep processes were spawned by the reconnect logic itself or were pre-existing subprocesses from active tool calls (terminal commands) running at the time of shutdown. This distinction matters for root cause — it may be a child process cleanup gap on shutdown rather than a reconnect-specific leak.
This has happened twice: once the previous evening (gateway died silently and was not discovered until the following morning), and once during an active session after the reconnect attempt.
Related: #4057 (missing timeouts on subprocess/network calls). Note also that #7503 (drain in-flight work before restart) was merged and addresses graceful shutdown on /restart, but the systemd SIGTERM path appears to still have this gap.
Steps to Reproduce
- Run
hermes-gateway.service under systemd as a user service
- Have an active session with in-flight tool calls (e.g. bash/shell tool running)
- Anthropic API becomes unresponsive for 180+ seconds during that session
- Gateway logs:
No response from provider for 180s (model: claude-sonnet-4-6, context: ~614 tokens). Reconnecting...
- Tool calls continue to be dispatched after the above log line
- Systemd issues SIGTERM to the gateway; gateway fails to exit cleanly within
TimeoutStopSec (value not recorded — see notes)
- Systemd escalates to SIGKILL on remaining child processes
Note: TimeoutStopSec for the hermes-gateway.service unit was not recorded during this incident. If a non-default value is relevant to reproduction, happy to check.
Expected Behavior
On shutdown (SIGTERM), the gateway waits for in-flight tool calls to complete or forcibly terminates child subprocesses before exiting, so systemd receives a clean exit within TimeoutStopSec.
Actual Behavior
From journalctl (05:37:48 UTC, gateway PID 28441):
hermes-gateway.service: Killing process 29771 (bash) with signal SIGKILL.
hermes-gateway.service: Killing process 29773 (sleep) with signal SIGKILL.
hermes-gateway.service: Consumed 27.724s CPU time.
Gateway did not restart automatically. Manual restart at 05:44:04 UTC started new instance (PID 29849).
Environment
- OS: Ubuntu
- Service type: systemd user service (not system-level)
- Trigger: Anthropic provider 180s timeout during active session with running tool calls
- TimeoutStopSec: unknown (not recorded during incident)
- Hermes-agent version: latest main
- Restart= in unit file: not configured (required manual restart)
Describe the bug
When the Anthropic provider stops responding, the gateway logs a 180s timeout and attempts to reconnect. In the observed incident, tool calls (bash, sleep) continued to execute after the reconnect log message was emitted. When systemd eventually sent SIGTERM to the gateway process, it did not clean up its child processes within systemd's stop timeout, causing systemd to SIGKILL the remaining bash and sleep subprocesses.
Note: It is unclear whether the bash/sleep processes were spawned by the reconnect logic itself or were pre-existing subprocesses from active tool calls (terminal commands) running at the time of shutdown. This distinction matters for root cause — it may be a child process cleanup gap on shutdown rather than a reconnect-specific leak.
This has happened twice: once the previous evening (gateway died silently and was not discovered until the following morning), and once during an active session after the reconnect attempt.
Related: #4057 (missing timeouts on subprocess/network calls). Note also that #7503 (drain in-flight work before restart) was merged and addresses graceful shutdown on
/restart, but the systemd SIGTERM path appears to still have this gap.Steps to Reproduce
hermes-gateway.serviceunder systemd as a user serviceNo response from provider for 180s (model: claude-sonnet-4-6, context: ~614 tokens). Reconnecting...TimeoutStopSec(value not recorded — see notes)Note:
TimeoutStopSecfor thehermes-gateway.serviceunit was not recorded during this incident. If a non-default value is relevant to reproduction, happy to check.Expected Behavior
On shutdown (SIGTERM), the gateway waits for in-flight tool calls to complete or forcibly terminates child subprocesses before exiting, so systemd receives a clean exit within
TimeoutStopSec.Actual Behavior
From
journalctl(05:37:48 UTC, gateway PID 28441):Gateway did not restart automatically. Manual restart at 05:44:04 UTC started new instance (PID 29849).
Environment