Skip to content

hermes gateway restart: CLI deadline equals gateway drain budget, false-fires 'still running after 60s' warning and force-kills mid-cleanup #25966

@chrisworksai

Description

@chrisworksai

Context

hermes gateway restart on macOS (launchd) prints ⚠ Gateway PID N still running after 60.0s — restart may fail on successful restarts and then runs launchctl kickstart -k (SIGKILL) against a gateway that was already shutting down cleanly. The force-kill races the gateway's cleanup tail, leaving sessions marked as auto-resumable when the original shutdown was valid.

Related to #17198 (same file, sibling site, fixed by PR #17292 — still open). Filing as an issue rather than a competing PR to defer to maintainer preference on whether to fold this into #17292 or accept a follow-up.

Observation

hermes_cli/gateway.py launchd_restart() around L3008:

exited = _wait_for_gateway_exit(timeout=drain_timeout, force_after=None)
if not exited:
    print(f"⚠ Gateway drain timed out after {drain_timeout:.0f}s — forcing launchd restart")
subprocess.run(["launchctl", "kickstart", "-k", target], ...)

CLI deadline = drain_timeout (default 60s, from HERMES_RESTART_DRAIN_TIMEOUT).
Gateway-side drain budget = also 60s, plus ~1.2s for adapter-disconnect / SessionDB close / atexit / final exit.

The two deadlines collide. Any drain that runs close to its full budget loses the race even on a clean shutdown.

Evidence from a live log (~/.hermes/logs/gateway.log)

2026-05-14 16:19:59  Received SIGTERM — initiating shutdown
2026-05-14 16:20:49  Shutdown phase: drain done at +50.09s
                     (drain took 49.36s, timed_out=False, active_at_start=1, active_now=0)

Clean drain finished at +50.09s; CLI deadline at +60s passed it by 10s. Three back-to-back hermes gateway restart invocations on the same machine all printed the false-alarm warning even though timed_out=False for two of them.

When the CLI does cross 60s, it falls through to launchctl kickstart -k mid-cleanup → no .clean_shutdown marker → next boot:

INFO gateway.run: Marked 1 in-flight session(s) as resumable from previous run
INFO gateway.run: Scheduled auto-resume for 1 restart-interrupted session(s)

Sessions get auto-resume-tagged on what was actually a graceful shutdown.

Analysis

The CLI deadline must exceed the gateway deadline by enough to cover the cleanup tail (~5–15s observed). Today it equals the gateway deadline, guaranteeing this whenever drain uses its full budget.

PR #17292 fixed an analogous bug at service_stop() (L2946):

# Before #17292
_wait_for_gateway_exit(timeout=10.0, force_after=5.0)
# After #17292
_wait_for_gateway_exit(timeout=max(_drain, 20.0), force_after=min(_drain * 0.5, 10.0))

The launchd_restart() call site wasn't updated in that PR.

Possible approaches

  1. Extend the CLI deadline past the gateway deadline. Smallest fix:

    exited = _wait_for_gateway_exit(timeout=drain_timeout + 15.0, force_after=drain_timeout + 10.0)

    ~2 LOC, same file, launchd_restart() only.

  2. Apply fix: use configured drain timeout for gateway restart wait (#17198) #17292's pattern at this site. After fix: use configured drain timeout for gateway restart wait (#17198) #17292 merges, reuse the max(_drain, 20.0) / min(_drain * 0.5, 10.0) shape for consistency. ~2 LOC.

  3. Emit the warning only when actually force-killing. The current message conflates "still draining (fine)" with "wedged (bad)". Move/reword. ~5 LOC.

1 and 3 are independent — both small.

Reproduction

On macOS with at least one mid-tool-call agent session active:

hermes gateway restart
hermes gateway restart
hermes gateway restart

Check gateway.log for drain took Ns with N close to but under 60 — those are the false alarms. If any go over 60, you'll also see the auto-resume tag on a session that didn't need it.

Questions

  1. Is this site intentionally left for a follow-up to fix: use configured drain timeout for gateway restart wait (#17198) #17292, or did the PR just miss it?
  2. Want me to PR the ~2 LOC fix once fix: use configured drain timeout for gateway restart wait (#17198) #17292 merges, or fold into a single combined PR?
  3. The deeper "tool call won't release during drain" cause (real 60s timeout 2026-05-14 00:05:43) — separate issue, or covered by [Bug]: Gateway hang on clean exit / restart race with stale PID #14176 / Gateway self-restart from active WhatsApp chat can self-block graceful drain and always hit drain timeout #20694?

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium — degraded but workaround existscomp/cliCLI entry point, hermes_cli/, setup wizardcomp/gatewayGateway runner, session dispatch, deliverytype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions