fix(gateway): bound adapter network awaits on the shutdown path by banditburai · Pull Request #36913 · NousResearch/hermes-agent

banditburai · 2026-06-01T17:14:42Z

Summary

On the gateway shutdown path (GatewayRunner._stop_impl), adapter network calls — disconnect(), cancel_background_tasks(), and the shutdown-notification send()s — were awaited with no timeout while the runtime lock was held. A half-dead platform (TCP black-hole, wedged WebSocket) makes one of these awaits block indefinitely, so the process never exits; launchd/systemd then SIGKILLs it mid-shutdown, leaving a stale gateway.pid / gateway.lock that the next start treats as a live instance.

This change wraps each of those awaits in asyncio.wait_for at the call site, using the existing HERMES_GATEWAY_ADAPTER_DISCONNECT_TIMEOUT knob (_ADAPTER_DISCONNECT_TIMEOUT_SECS_DEFAULT = 5.0, gateway/run.py:67). No new env var or constant.

Change

gateway/run.py only (helpers + call sites):

_safe_adapter_teardown (gateway/run.py:2253) bounds disconnect() + cancel_background_tasks() per adapter. Called from _teardown_adapters (gateway/run.py:2298), which replaces the former sequential per-adapter loop and runs the teardowns under one asyncio.gather(return_exceptions=True) — call site gateway/run.py:6541.
_bounded_shutdown_send (gateway/run.py:2316) bounds the three shutdown-notification sends (gateway/run.py:3752, :3807, :3809). On timeout it logs a WARNING and returns None; None is the existing adapter contract for a send with no structured result, so the callers' result is not None check is unchanged and the target is added to the notified dedup set.
_safe_adapter_disconnect (gateway/run.py:2222, pre-existing) now bounds its disconnect() and is used on the fatal-error path inside the existing try/finally at gateway/run.py:2778.

Each helper takes a bare await (no wait_for) when the resolved timeout is <= 0, preserving an opt-out. The helpers catch Exception and asyncio.TimeoutError, not CancelledError (a BaseException on Python ≥3.11), so cancellation still propagates; the try/finally and gather(return_exceptions=True) perform cleanup in that case.

Unchanged: gateway/status.py (PID/lock machinery); _db.close() (gateway/run.py:6598) and shutdown_cached_clients() (gateway/run.py:6584), which are synchronous (see Note); _launch_detached_restart_command() (fire-and-forget subprocess.Popen); the agent drain, which is bounded by its own deadline argument.

Behavior

Call site	Before	After
teardown loop (`run.py:6541`)	sequential per-adapter `disconnect()` + `cancel_background_tasks()`, each unbounded	concurrent `gather`; each op bounded by the timeout; phase ≈ `max(timeout)` rather than the per-adapter sum
notify sends (`run.py:3752/:3807/:3809`)	unbounded `send()`; a wedged socket blocks `_stop_impl`	bounded `send()`; on timeout returns `None`, caller logs and continues to the next target
fatal-error disconnect (`run.py:2778`)	unbounded `disconnect()`; if it raised, the exception left the fire-and-forget task before the `if adapter.fatal_error_retryable:` block below the `try/finally`, so a retryable platform was not queued for reconnect	bounded `disconnect()`, error swallowed; the reconnection-queueing block is always reached
any bounded op, `timeout <= 0`	n/a	bare `await`, unbounded (opt-out)

Testing

uv run --extra dev pytest <the 13 tests/gateway files exercising the changed methods> -p no:xdist → 227 passed. The 228th, test_gateway_shutdown.py::test_gateway_stop_systemd_service_restart_exits_cleanly, asserts the Linux systemd exit code (0) but does not patch sys.platform, so it only passes on a Linux host (forcing sys.platform="linux" makes it pass; the macOS branch — exit 75 — is covered by its sibling test_gateway_stop_launchd_service_restart_keeps_nonzero_exit). It reproduces on unmodified main and is in the service-restart exit-code path, which this change does not touch. Run xdist-isolated because two further unrelated areas pollute across workers in the full tests/gateway/ run (telegram model-picker/slash-confirm; wecom-callback, which also needs the optional defusedxml dep) — those pass when each file runs alone.

New tests/gateway/test_shutdown_disconnect_timeout.py (7 tests):

test_hung_disconnect_does_not_stall_teardown — _safe_adapter_teardown returns within 2s when disconnect() never completes.
test_hung_notify_send_is_bounded — _bounded_shutdown_send returns None on a never-completing send().
test_fatal_error_disconnect_is_bounded — _handle_adapter_fatal_error still pops the dead adapter from self.adapters when disconnect() hangs.
test_teardown_runs_adapters_concurrently — two hanging adapters at a 0.2s timeout finish in 0.18 < elapsed < 0.38 (lower bound: the timeout fired; upper bound: gather ran them concurrently, not the ~0.4s of a sequential loop).
test_zero_timeout_opts_out_bounded_send / _teardown / _safe_disconnect — with the env set to "0", a 0.05s op runs to completion / returns its real value, confirming the <= 0 branch takes a bare await rather than wait_for(..., timeout=0).

Modified tests/gateway/test_shutdown_cache_cleanup.py — its _FakeGateway drives the real _stop_impl, so it gains a one-line _teardown_adapters no-op to match the extracted method.

Verification is unit-level: the timeout bounding is exercised directly; the end-to-end sequence (wedged socket → SIGKILL → stale file) is not reproduced in CI.

Note

_db.close() (gateway/run.py:6598) and shutdown_cached_clients() (gateway/run.py:6584) remain on the shutdown path as synchronous, unbounded calls; asyncio.wait_for cannot interrupt a synchronous call, and running _db.close() in a thread and abandoning it on timeout would leave the SQLite WAL lock held — the lock that the close exists to release. These are local, process-owned operations. A future change could quiesce writers / checkpoint the WAL before close if a stall is ever observed there.

fix(gateway): add timeout to adapter.disconnect() during shutdown #14130 — earlier approach: a timeout around adapter.disconnect() in the shutdown loop, with a hardcoded interval. This change covers that call plus cancel_background_tasks(), the notify sends, and the fatal-error disconnect, and reads the interval from the existing env knob.
fix(gateway): clean stale PID file before O_EXCL write #25569 — addresses the same symptom at the PID layer (gateway/status.py). main already performs a stale-PID pre-clean before the O_EXCL write (gateway/status.py:286 _cleanup_invalid_pid_path, :999 get_running_pid) and replaced os.kill(pid, 0) with _pid_exists for Windows correctness in 324567c93.
fix(gateway): explain stale runtime lock failures (#28561) #28603 — a diagnostic-message change for stale-lock startup failures, in a region this change does not touch.
[Bug]: Hermes Gateway PID 文件竞态问题及修复方案 #13511, Bug: Stale gateway.pid causes gateway restart loop after crash/SIGKILL #13655, # Issue: start_gateway should verify lock-holder PID is alive before treating stale lock as "another instance" #28561 — the stale-file reclaim these describe is handled on main: the advisory lock auto-releases on holder death (fcntl.flock / msvcrt.locking, gateway/status.py:327-329), and get_running_pid() unlinks the stale gateway.pid and gateway.lock before the next start acquires the lock.

Fixes #14128
Addresses #14176
Refs #13511, #13655, #14130, #25569, #28561, #28603

The clean-shutdown teardown loop awaited adapter.disconnect() and cancel_background_tasks() with no timeout, so a single wedged adapter (commonly a stuck Feishu/Telegram WebSocket) hung shutdown until systemd TimeoutStopSec / launchd escalated to SIGKILL — dropping in-flight work and leaving stale gateway.pid / gateway.lock behind, surfacing as the 'PID file race lost' / restart-failure reports. Route per-adapter teardown (cancel + disconnect) through a new bounded, never-raising _safe_adapter_teardown helper using the existing 5s, env-tunable HERMES_GATEWAY_ADAPTER_DISCONNECT_TIMEOUT. Teardown stays sequential in this commit. Note: for thread-backed adapters the timeout bounds teardown by abandoning the worker thread (reclaimed at process exit), not by gracefully closing the socket. Fixes NousResearch#14128 Fixes NousResearch#14176

_notify_active_sessions_of_shutdown() runs first in _stop_impl with the runtime lock held and issued three unbounded adapter.send() calls. A send wedged on a half-dead platform hung shutdown before adapter teardown was reached — the same stale-pid/lock + SIGKILL failure mode as the disconnect hang. Bound all three via _bounded_shutdown_send using the shared per-adapter shutdown timeout. Refs NousResearch#14128

_handle_adapter_fatal_error awaited adapter.disconnect() with no timeout. The common trigger is a network failure (e.g. Telegram polling errors exhausted), which is exactly when the adapter's disconnect can wedge — hanging before the background-reconnection enqueue and leaving the platform silently never-reconnected. Route it through the existing bounded _safe_adapter_disconnect helper (now the shared bounded-disconnect resolver). The try/finally stays so a teardown-time cancellation still drops the dead adapter. Refs NousResearch#14128

With per-adapter disconnect now bounded, run the teardowns concurrently via asyncio.gather(return_exceptions=True) so the phase is ~max(timeout) instead of N×timeout. Closes the macOS N>=4-adapter tail where sequential teardown could exceed launchd's fixed shutdown grace. Adapters are independent; shared cleanup runs after the gather. Extracted into _teardown_adapters so the concurrency is unit-testable and reverts as one hunk. Refs NousResearch#14128

…rency bound Add behavioral tests that the timeout<=0 env opt-out takes a bare await (not wait_for(..., timeout=0)) for _bounded_shutdown_send, _safe_adapter_teardown, and _safe_adapter_disconnect. Tighten the concurrent-teardown timing assertion to 0.18 < elapsed < 0.38 so it also proves the per-adapter timeout fired, not just that gather ran.

banditburai added 5 commits June 1, 2026 12:56

alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/gateway Gateway runner, session dispatch, delivery labels Jun 1, 2026

chazmaniandinkle mentioned this pull request Jun 11, 2026

fix(gateway): flush in-flight sends before cancelling on planned restart #43903

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(gateway): bound adapter network awaits on the shutdown path#36913

fix(gateway): bound adapter network awaits on the shutdown path#36913
banditburai wants to merge 5 commits into
NousResearch:mainfrom
banditburai:fix/gateway-stale-pid-restart

banditburai commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

banditburai commented Jun 1, 2026

Summary

Change

Behavior

Testing

Note

Related

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants