fix(gateway): bound adapter network awaits on the shutdown path#36913
Open
banditburai wants to merge 5 commits into
Open
fix(gateway): bound adapter network awaits on the shutdown path#36913banditburai wants to merge 5 commits into
banditburai wants to merge 5 commits into
Conversation
The clean-shutdown teardown loop awaited adapter.disconnect() and cancel_background_tasks() with no timeout, so a single wedged adapter (commonly a stuck Feishu/Telegram WebSocket) hung shutdown until systemd TimeoutStopSec / launchd escalated to SIGKILL — dropping in-flight work and leaving stale gateway.pid / gateway.lock behind, surfacing as the 'PID file race lost' / restart-failure reports. Route per-adapter teardown (cancel + disconnect) through a new bounded, never-raising _safe_adapter_teardown helper using the existing 5s, env-tunable HERMES_GATEWAY_ADAPTER_DISCONNECT_TIMEOUT. Teardown stays sequential in this commit. Note: for thread-backed adapters the timeout bounds teardown by abandoning the worker thread (reclaimed at process exit), not by gracefully closing the socket. Fixes NousResearch#14128 Fixes NousResearch#14176
_notify_active_sessions_of_shutdown() runs first in _stop_impl with the runtime lock held and issued three unbounded adapter.send() calls. A send wedged on a half-dead platform hung shutdown before adapter teardown was reached — the same stale-pid/lock + SIGKILL failure mode as the disconnect hang. Bound all three via _bounded_shutdown_send using the shared per-adapter shutdown timeout. Refs NousResearch#14128
_handle_adapter_fatal_error awaited adapter.disconnect() with no timeout. The common trigger is a network failure (e.g. Telegram polling errors exhausted), which is exactly when the adapter's disconnect can wedge — hanging before the background-reconnection enqueue and leaving the platform silently never-reconnected. Route it through the existing bounded _safe_adapter_disconnect helper (now the shared bounded-disconnect resolver). The try/finally stays so a teardown-time cancellation still drops the dead adapter. Refs NousResearch#14128
With per-adapter disconnect now bounded, run the teardowns concurrently via asyncio.gather(return_exceptions=True) so the phase is ~max(timeout) instead of N×timeout. Closes the macOS N>=4-adapter tail where sequential teardown could exceed launchd's fixed shutdown grace. Adapters are independent; shared cleanup runs after the gather. Extracted into _teardown_adapters so the concurrency is unit-testable and reverts as one hunk. Refs NousResearch#14128
…rency bound Add behavioral tests that the timeout<=0 env opt-out takes a bare await (not wait_for(..., timeout=0)) for _bounded_shutdown_send, _safe_adapter_teardown, and _safe_adapter_disconnect. Tighten the concurrent-teardown timing assertion to 0.18 < elapsed < 0.38 so it also proves the per-adapter timeout fired, not just that gather ran.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
On the gateway shutdown path (
GatewayRunner._stop_impl), adapter network calls —disconnect(),cancel_background_tasks(), and the shutdown-notificationsend()s — were awaited with no timeout while the runtime lock was held. A half-dead platform (TCP black-hole, wedged WebSocket) makes one of these awaits block indefinitely, so the process never exits; launchd/systemd then SIGKILLs it mid-shutdown, leaving a stalegateway.pid/gateway.lockthat the next start treats as a live instance.This change wraps each of those awaits in
asyncio.wait_forat the call site, using the existingHERMES_GATEWAY_ADAPTER_DISCONNECT_TIMEOUTknob (_ADAPTER_DISCONNECT_TIMEOUT_SECS_DEFAULT = 5.0,gateway/run.py:67). No new env var or constant.Change
gateway/run.pyonly (helpers + call sites):_safe_adapter_teardown(gateway/run.py:2253) boundsdisconnect()+cancel_background_tasks()per adapter. Called from_teardown_adapters(gateway/run.py:2298), which replaces the former sequential per-adapter loop and runs the teardowns under oneasyncio.gather(return_exceptions=True)— call sitegateway/run.py:6541._bounded_shutdown_send(gateway/run.py:2316) bounds the three shutdown-notification sends (gateway/run.py:3752,:3807,:3809). On timeout it logs aWARNINGand returnsNone;Noneis the existing adapter contract for a send with no structured result, so the callers'result is not Nonecheck is unchanged and the target is added to thenotifieddedup set._safe_adapter_disconnect(gateway/run.py:2222, pre-existing) now bounds itsdisconnect()and is used on the fatal-error path inside the existingtry/finallyatgateway/run.py:2778.Each helper takes a bare
await(nowait_for) when the resolved timeout is<= 0, preserving an opt-out. The helpers catchExceptionandasyncio.TimeoutError, notCancelledError(aBaseExceptionon Python ≥3.11), so cancellation still propagates; thetry/finallyandgather(return_exceptions=True)perform cleanup in that case.Unchanged:
gateway/status.py(PID/lock machinery);_db.close()(gateway/run.py:6598) andshutdown_cached_clients()(gateway/run.py:6584), which are synchronous (see Note);_launch_detached_restart_command()(fire-and-forgetsubprocess.Popen); the agent drain, which is bounded by its own deadline argument.Behavior
run.py:6541)disconnect()+cancel_background_tasks(), each unboundedgather; each op bounded by the timeout; phase ≈max(timeout)rather than the per-adapter sumrun.py:3752/:3807/:3809)send(); a wedged socket blocks_stop_implsend(); on timeout returnsNone, caller logs and continues to the next targetrun.py:2778)disconnect(); if it raised, the exception left the fire-and-forget task before theif adapter.fatal_error_retryable:block below thetry/finally, so a retryable platform was not queued for reconnectdisconnect(), error swallowed; the reconnection-queueing block is always reachedtimeout <= 0await, unbounded (opt-out)Testing
uv run --extra dev pytest <the 13 tests/gateway files exercising the changed methods> -p no:xdist→ 227 passed. The 228th,test_gateway_shutdown.py::test_gateway_stop_systemd_service_restart_exits_cleanly, asserts the Linux systemd exit code (0) but does not patchsys.platform, so it only passes on a Linux host (forcingsys.platform="linux"makes it pass; the macOS branch — exit75— is covered by its siblingtest_gateway_stop_launchd_service_restart_keeps_nonzero_exit). It reproduces on unmodifiedmainand is in the service-restart exit-code path, which this change does not touch. Run xdist-isolated because two further unrelated areas pollute across workers in the fulltests/gateway/run (telegram model-picker/slash-confirm; wecom-callback, which also needs the optionaldefusedxmldep) — those pass when each file runs alone.New
tests/gateway/test_shutdown_disconnect_timeout.py(7 tests):test_hung_disconnect_does_not_stall_teardown—_safe_adapter_teardownreturns within 2s whendisconnect()never completes.test_hung_notify_send_is_bounded—_bounded_shutdown_sendreturnsNoneon a never-completingsend().test_fatal_error_disconnect_is_bounded—_handle_adapter_fatal_errorstill pops the dead adapter fromself.adapterswhendisconnect()hangs.test_teardown_runs_adapters_concurrently— two hanging adapters at a 0.2s timeout finish in0.18 < elapsed < 0.38(lower bound: the timeout fired; upper bound:gatherran them concurrently, not the ~0.4s of a sequential loop).test_zero_timeout_opts_out_bounded_send/_teardown/_safe_disconnect— with the env set to"0", a 0.05s op runs to completion / returns its real value, confirming the<= 0branch takes a bareawaitrather thanwait_for(..., timeout=0).Modified
tests/gateway/test_shutdown_cache_cleanup.py— its_FakeGatewaydrives the real_stop_impl, so it gains a one-line_teardown_adaptersno-op to match the extracted method.Verification is unit-level: the timeout bounding is exercised directly; the end-to-end sequence (wedged socket → SIGKILL → stale file) is not reproduced in CI.
Note
_db.close()(gateway/run.py:6598) andshutdown_cached_clients()(gateway/run.py:6584) remain on the shutdown path as synchronous, unbounded calls;asyncio.wait_forcannot interrupt a synchronous call, and running_db.close()in a thread and abandoning it on timeout would leave the SQLite WAL lock held — the lock that the close exists to release. These are local, process-owned operations. A future change could quiesce writers / checkpoint the WAL before close if a stall is ever observed there.Related
adapter.disconnect()in the shutdown loop, with a hardcoded interval. This change covers that call pluscancel_background_tasks(), the notify sends, and the fatal-error disconnect, and reads the interval from the existing env knob.gateway/status.py).mainalready performs a stale-PID pre-clean before theO_EXCLwrite (gateway/status.py:286_cleanup_invalid_pid_path,:999get_running_pid) and replacedos.kill(pid, 0)with_pid_existsfor Windows correctness in324567c93.start_gatewayshould verify lock-holder PID is alive before treating stale lock as "another instance" #28561 — the stale-file reclaim these describe is handled onmain: the advisory lock auto-releases on holder death (fcntl.flock/msvcrt.locking,gateway/status.py:327-329), andget_running_pid()unlinks the stalegateway.pidandgateway.lockbefore the next start acquires the lock.Fixes #14128
Addresses #14176
Refs #13511, #13655, #14130, #25569, #28561, #28603