Skip to content

fix: use configured drain timeout for gateway restart wait (#17198)#17292

Open
vominh1919 wants to merge 1 commit into
NousResearch:mainfrom
vominh1919:fix/gateway-restart-drain-timeout
Open

fix: use configured drain timeout for gateway restart wait (#17198)#17292
vominh1919 wants to merge 1 commit into
NousResearch:mainfrom
vominh1919:fix/gateway-restart-drain-timeout

Conversation

@vominh1919

Copy link
Copy Markdown
Contributor

Problem

The hermes gateway restart command hardcoded a 10-second timeout for _wait_for_gateway_exit(), but platform adapters like Weixin can take 18+ seconds to release their tokens during graceful shutdown. This causes a race condition where the new gateway fails to start because the old process's platform lock is still held.

Fix

Now uses the configured restart_drain_timeout (default 60s) with a minimum of 20s for the wait timeout, and 50% of the drain timeout (capped at 10s) for the force-kill threshold. This matches the existing drain timeout mechanism used by systemd/launchd restarts.

Before:

_wait_for_gateway_exit(timeout=10.0, force_after=5.0)

After:

_drain = _get_restart_drain_timeout()
_wait_for_gateway_exit(timeout=max(_drain, 20.0), force_after=min(_drain * 0.5, 10.0))

Before vs After

Scenario Before After
Weixin disconnect takes 18s Timeout at 10s, new gateway fails Waits up to 60s (configurable), new gateway succeeds
Quick disconnect (2s) Works (10s > 2s) Works (20s > 2s)
Process hung (needs force-kill) SIGKILL at 5s SIGKILL at 10s (or 50% of drain timeout)

Fixes #17198

The gateway restart command hardcoded a 10-second timeout for
_wait_for_gateway_exit(), but platform adapters like Weixin can
take 18+ seconds to release their tokens during graceful shutdown.

Now uses the configured restart_drain_timeout (default 60s) with
a minimum of 20s, matching the existing drain timeout mechanism
used by systemd/launchd restarts.

Fixes NousResearch#17198
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/cli CLI entry point, hermes_cli/, setup wizard comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

gateway restart: race condition causes Weixin token conflict

2 participants