Skip to content

Gateway process not cleaned up on unclean shutdown after provider timeout — bash/sleep children SIGKILL'd by systemd #8202

@philmossman

Description

@philmossman

Describe the bug

When the Anthropic provider stops responding, the gateway logs a 180s timeout and attempts to reconnect. In the observed incident, tool calls (bash, sleep) continued to execute after the reconnect log message was emitted. When systemd eventually sent SIGTERM to the gateway process, it did not clean up its child processes within systemd's stop timeout, causing systemd to SIGKILL the remaining bash and sleep subprocesses.

Note: It is unclear whether the bash/sleep processes were spawned by the reconnect logic itself or were pre-existing subprocesses from active tool calls (terminal commands) running at the time of shutdown. This distinction matters for root cause — it may be a child process cleanup gap on shutdown rather than a reconnect-specific leak.

This has happened twice: once the previous evening (gateway died silently and was not discovered until the following morning), and once during an active session after the reconnect attempt.

Related: #4057 (missing timeouts on subprocess/network calls). Note also that #7503 (drain in-flight work before restart) was merged and addresses graceful shutdown on /restart, but the systemd SIGTERM path appears to still have this gap.

Steps to Reproduce

  1. Run hermes-gateway.service under systemd as a user service
  2. Have an active session with in-flight tool calls (e.g. bash/shell tool running)
  3. Anthropic API becomes unresponsive for 180+ seconds during that session
  4. Gateway logs: No response from provider for 180s (model: claude-sonnet-4-6, context: ~614 tokens). Reconnecting...
  5. Tool calls continue to be dispatched after the above log line
  6. Systemd issues SIGTERM to the gateway; gateway fails to exit cleanly within TimeoutStopSec (value not recorded — see notes)
  7. Systemd escalates to SIGKILL on remaining child processes

Note: TimeoutStopSec for the hermes-gateway.service unit was not recorded during this incident. If a non-default value is relevant to reproduction, happy to check.

Expected Behavior

On shutdown (SIGTERM), the gateway waits for in-flight tool calls to complete or forcibly terminates child subprocesses before exiting, so systemd receives a clean exit within TimeoutStopSec.

Actual Behavior

From journalctl (05:37:48 UTC, gateway PID 28441):

hermes-gateway.service: Killing process 29771 (bash) with signal SIGKILL.
hermes-gateway.service: Killing process 29773 (sleep) with signal SIGKILL.
hermes-gateway.service: Consumed 27.724s CPU time.

Gateway did not restart automatically. Manual restart at 05:44:04 UTC started new instance (PID 29849).

Environment

  • OS: Ubuntu
  • Service type: systemd user service (not system-level)
  • Trigger: Anthropic provider 180s timeout during active session with running tool calls
  • TimeoutStopSec: unknown (not recorded during incident)
  • Hermes-agent version: latest main
  • Restart= in unit file: not configured (required manual restart)

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High — major feature broken, no workaroundcomp/gatewayGateway runner, session dispatch, deliverytype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions