Skip to content

fix(gateway): bound Yuanbao WS close handshake to avoid 5s shutdown stall#40421

Closed
maxmilian wants to merge 1 commit into
NousResearch:mainfrom
maxmilian:fix/yuanbao-shutdown-close-timeout
Closed

fix(gateway): bound Yuanbao WS close handshake to avoid 5s shutdown stall#40421
maxmilian wants to merge 1 commit into
NousResearch:mainfrom
maxmilian:fix/yuanbao-shutdown-close-timeout

Conversation

@maxmilian

Copy link
Copy Markdown
Contributor

Summary

Fixes #40383. The Yuanbao adapter consistently added a fixed ~5s delay to gateway shutdown even when idle. This bounds the WS close handshake so a non-responsive server can no longer stall teardown.

Root cause

ConnectionManager._cleanup_ws() (gateway/platforms/yuanbao.py) awaited ws.close() unbounded. The connection is opened with close_timeout=5, so the websockets close handshake blocks up to 5s waiting for the server's close-frame echo. On an idle shutdown the server never replies, so teardown always waited the full ~5s — matching the reporter's ✓ yuanbao disconnected (5.01s) log.

Call path: gateway.run times adapter.disconnect()ConnectionManager.close() (cancels heartbeat/recv tasks — fast) → await self._cleanup_ws()await ws.close() ← the entire ~5s lives here.

Fix

Bound the close await with asyncio.wait_for(ws.close(), timeout=WS_CLOSE_TIMEOUT_S) (1.0s, new module constant near the other WS timeouts). A responsive server completes the handshake in well under a second, so this only caps the pathological hang; the graceful close is still attempted on the fast path. The reconnect paths that reuse _cleanup_ws() benefit too. asyncio.TimeoutError is swallowed alongside the pre-existing except Exception (it is a subclass), dropping a dead connection rather than stalling.

Tests

New tests/test_yuanbao_shutdown.py (3 cases):

  • hung server (close never echoes) — teardown stays bounded (~1.0s, was ~5.0s)
  • responsive server — fast path returns immediately
  • close() raises — ws reference still cleared

scripts/run_tests.sh over all 6 Yuanbao test files: 220 passed. The new file alone went 5.13s → 1.2s once bounded. ruff + scripts/check-windows-footguns.py clean on both changed files.

Scope

Deliberately unchanged: the close_timeout=5 on the two websockets.connect() sites (it governs the library's own handshake cap and is now superseded by the shorter outer bound during teardown); the heartbeat/recv-task cancellation in ConnectionManager.close() (already fast). No behavior change for a server that closes promptly.

@maxmilian maxmilian force-pushed the fix/yuanbao-shutdown-close-timeout branch from ba73467 to 86b2542 Compare June 6, 2026 10:13
@maxmilian

Copy link
Copy Markdown
Contributor Author

Note on the reconnect paths: _cleanup_ws() is also called from the connect-failure / reconnect routines, not just shutdown. The 1s bound is a strict improvement there too — before this change each of those cleanups could itself stall up to the full close_timeout=5 against an unresponsive server, so a flapping reconnect got faster, not slower. The fast path (server echoes the close frame) is unchanged. A timed-out close is now logged at debug to make any future shutdown-hang easy to spot.

…tall

The Yuanbao adapter's ConnectionManager._cleanup_ws() awaited ws.close()
unbounded. The websockets connection is opened with close_timeout=5, so the
close handshake blocks up to 5s waiting for the server's close-frame echo. On
an idle gateway shutdown the server never replies, making teardown consistently
take ~5s (NousResearch#40383).

Bound the close await with asyncio.wait_for(WS_CLOSE_TIMEOUT_S=1.0): a
responsive server completes the handshake in well under a second, so this only
caps the pathological hang. The reconnect/connect-failure paths that reuse
_cleanup_ws() benefit too (each previously could stall up to 5s, now <=1s). A
timed-out close is logged at debug to aid future shutdown-hang diagnosis.

Adds tests/test_yuanbao_shutdown.py: hung-server bound, fast-path, and
close()-raises cases. The hung-server repro went from 5.0s to ~1.0s.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@maxmilian maxmilian force-pushed the fix/yuanbao-shutdown-close-timeout branch from 86b2542 to ca0e06a Compare June 6, 2026 10:17
@maxmilian maxmilian marked this pull request as ready for review June 6, 2026 10:27
@alt-glitch alt-glitch added type/perf Performance improvement or optimization P3 Low — cosmetic, nice to have comp/gateway Gateway runner, session dispatch, delivery labels Jun 6, 2026
@teknium1

teknium1 commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

Salvaged into #40607 with your authorship credited (Co-authored-by). Re-verified on current main, tightened, and tested. Thanks!

#40607

@teknium1 teknium1 closed this Jun 6, 2026
teknium1 added a commit that referenced this pull request Jun 8, 2026
… ~5s (#40607)

Salvaged from #40421; re-verified on main, tightened, tested.

Co-authored-by: maxmilian <maxmilian@users.noreply.github.com>
changman pushed a commit to changman/hermes-agent that referenced this pull request Jun 10, 2026
… ~5s (NousResearch#40607)

Salvaged from NousResearch#40421; re-verified on main, tightened, tested.

Co-authored-by: maxmilian <maxmilian@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P3 Low — cosmetic, nice to have type/perf Performance improvement or optimization

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Yuanbao adapter adds a fixed ~5 second delay during gateway shutdown

3 participants