Skip to content

fix(gateway): bound adapter network awaits on the shutdown path#36913

Open
banditburai wants to merge 5 commits into
NousResearch:mainfrom
banditburai:fix/gateway-stale-pid-restart
Open

fix(gateway): bound adapter network awaits on the shutdown path#36913
banditburai wants to merge 5 commits into
NousResearch:mainfrom
banditburai:fix/gateway-stale-pid-restart

Conversation

@banditburai

Copy link
Copy Markdown
Contributor

Summary

On the gateway shutdown path (GatewayRunner._stop_impl), adapter network calls — disconnect(), cancel_background_tasks(), and the shutdown-notification send()s — were awaited with no timeout while the runtime lock was held. A half-dead platform (TCP black-hole, wedged WebSocket) makes one of these awaits block indefinitely, so the process never exits; launchd/systemd then SIGKILLs it mid-shutdown, leaving a stale gateway.pid / gateway.lock that the next start treats as a live instance.

This change wraps each of those awaits in asyncio.wait_for at the call site, using the existing HERMES_GATEWAY_ADAPTER_DISCONNECT_TIMEOUT knob (_ADAPTER_DISCONNECT_TIMEOUT_SECS_DEFAULT = 5.0, gateway/run.py:67). No new env var or constant.

Change

gateway/run.py only (helpers + call sites):

  • _safe_adapter_teardown (gateway/run.py:2253) bounds disconnect() + cancel_background_tasks() per adapter. Called from _teardown_adapters (gateway/run.py:2298), which replaces the former sequential per-adapter loop and runs the teardowns under one asyncio.gather(return_exceptions=True) — call site gateway/run.py:6541.
  • _bounded_shutdown_send (gateway/run.py:2316) bounds the three shutdown-notification sends (gateway/run.py:3752, :3807, :3809). On timeout it logs a WARNING and returns None; None is the existing adapter contract for a send with no structured result, so the callers' result is not None check is unchanged and the target is added to the notified dedup set.
  • _safe_adapter_disconnect (gateway/run.py:2222, pre-existing) now bounds its disconnect() and is used on the fatal-error path inside the existing try/finally at gateway/run.py:2778.

Each helper takes a bare await (no wait_for) when the resolved timeout is <= 0, preserving an opt-out. The helpers catch Exception and asyncio.TimeoutError, not CancelledError (a BaseException on Python ≥3.11), so cancellation still propagates; the try/finally and gather(return_exceptions=True) perform cleanup in that case.

Unchanged: gateway/status.py (PID/lock machinery); _db.close() (gateway/run.py:6598) and shutdown_cached_clients() (gateway/run.py:6584), which are synchronous (see Note); _launch_detached_restart_command() (fire-and-forget subprocess.Popen); the agent drain, which is bounded by its own deadline argument.

Behavior

Call site Before After
teardown loop (run.py:6541) sequential per-adapter disconnect() + cancel_background_tasks(), each unbounded concurrent gather; each op bounded by the timeout; phase ≈ max(timeout) rather than the per-adapter sum
notify sends (run.py:3752/:3807/:3809) unbounded send(); a wedged socket blocks _stop_impl bounded send(); on timeout returns None, caller logs and continues to the next target
fatal-error disconnect (run.py:2778) unbounded disconnect(); if it raised, the exception left the fire-and-forget task before the if adapter.fatal_error_retryable: block below the try/finally, so a retryable platform was not queued for reconnect bounded disconnect(), error swallowed; the reconnection-queueing block is always reached
any bounded op, timeout <= 0 n/a bare await, unbounded (opt-out)

Testing

uv run --extra dev pytest <the 13 tests/gateway files exercising the changed methods> -p no:xdist → 227 passed. The 228th, test_gateway_shutdown.py::test_gateway_stop_systemd_service_restart_exits_cleanly, asserts the Linux systemd exit code (0) but does not patch sys.platform, so it only passes on a Linux host (forcing sys.platform="linux" makes it pass; the macOS branch — exit 75 — is covered by its sibling test_gateway_stop_launchd_service_restart_keeps_nonzero_exit). It reproduces on unmodified main and is in the service-restart exit-code path, which this change does not touch. Run xdist-isolated because two further unrelated areas pollute across workers in the full tests/gateway/ run (telegram model-picker/slash-confirm; wecom-callback, which also needs the optional defusedxml dep) — those pass when each file runs alone.

New tests/gateway/test_shutdown_disconnect_timeout.py (7 tests):

  • test_hung_disconnect_does_not_stall_teardown_safe_adapter_teardown returns within 2s when disconnect() never completes.
  • test_hung_notify_send_is_bounded_bounded_shutdown_send returns None on a never-completing send().
  • test_fatal_error_disconnect_is_bounded_handle_adapter_fatal_error still pops the dead adapter from self.adapters when disconnect() hangs.
  • test_teardown_runs_adapters_concurrently — two hanging adapters at a 0.2s timeout finish in 0.18 < elapsed < 0.38 (lower bound: the timeout fired; upper bound: gather ran them concurrently, not the ~0.4s of a sequential loop).
  • test_zero_timeout_opts_out_bounded_send / _teardown / _safe_disconnect — with the env set to "0", a 0.05s op runs to completion / returns its real value, confirming the <= 0 branch takes a bare await rather than wait_for(..., timeout=0).

Modified tests/gateway/test_shutdown_cache_cleanup.py — its _FakeGateway drives the real _stop_impl, so it gains a one-line _teardown_adapters no-op to match the extracted method.

Verification is unit-level: the timeout bounding is exercised directly; the end-to-end sequence (wedged socket → SIGKILL → stale file) is not reproduced in CI.

Note

_db.close() (gateway/run.py:6598) and shutdown_cached_clients() (gateway/run.py:6584) remain on the shutdown path as synchronous, unbounded calls; asyncio.wait_for cannot interrupt a synchronous call, and running _db.close() in a thread and abandoning it on timeout would leave the SQLite WAL lock held — the lock that the close exists to release. These are local, process-owned operations. A future change could quiesce writers / checkpoint the WAL before close if a stall is ever observed there.

Related

Fixes #14128
Addresses #14176
Refs #13511, #13655, #14130, #25569, #28561, #28603

The clean-shutdown teardown loop awaited adapter.disconnect() and
cancel_background_tasks() with no timeout, so a single wedged adapter
(commonly a stuck Feishu/Telegram WebSocket) hung shutdown until systemd
TimeoutStopSec / launchd escalated to SIGKILL — dropping in-flight work
and leaving stale gateway.pid / gateway.lock behind, surfacing as the
'PID file race lost' / restart-failure reports.

Route per-adapter teardown (cancel + disconnect) through a new bounded,
never-raising _safe_adapter_teardown helper using the existing 5s,
env-tunable HERMES_GATEWAY_ADAPTER_DISCONNECT_TIMEOUT. Teardown stays
sequential in this commit. Note: for thread-backed adapters the timeout
bounds teardown by abandoning the worker thread (reclaimed at process
exit), not by gracefully closing the socket.

Fixes NousResearch#14128
Fixes NousResearch#14176
_notify_active_sessions_of_shutdown() runs first in _stop_impl with the
runtime lock held and issued three unbounded adapter.send() calls. A send
wedged on a half-dead platform hung shutdown before adapter teardown was
reached — the same stale-pid/lock + SIGKILL failure mode as the disconnect
hang. Bound all three via _bounded_shutdown_send using the shared
per-adapter shutdown timeout.

Refs NousResearch#14128
_handle_adapter_fatal_error awaited adapter.disconnect() with no timeout.
The common trigger is a network failure (e.g. Telegram polling errors
exhausted), which is exactly when the adapter's disconnect can wedge —
hanging before the background-reconnection enqueue and leaving the
platform silently never-reconnected. Route it through the existing
bounded _safe_adapter_disconnect helper (now the shared bounded-disconnect
resolver). The try/finally stays so a teardown-time cancellation still
drops the dead adapter.

Refs NousResearch#14128
With per-adapter disconnect now bounded, run the teardowns concurrently
via asyncio.gather(return_exceptions=True) so the phase is ~max(timeout)
instead of N×timeout. Closes the macOS N>=4-adapter tail where sequential
teardown could exceed launchd's fixed shutdown grace. Adapters are
independent; shared cleanup runs after the gather. Extracted into
_teardown_adapters so the concurrency is unit-testable and reverts as one
hunk.

Refs NousResearch#14128
…rency bound

Add behavioral tests that the timeout<=0 env opt-out takes a bare await
(not wait_for(..., timeout=0)) for _bounded_shutdown_send,
_safe_adapter_teardown, and _safe_adapter_disconnect. Tighten the
concurrent-teardown timing assertion to 0.18 < elapsed < 0.38 so it also
proves the per-adapter timeout fired, not just that gather ran.
@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/gateway Gateway runner, session dispatch, delivery labels Jun 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(gateway): Gateway shutdown hangs causing 'PID file race lost' on restart

2 participants