Skip to content

fix(homeassistant): cleanup ClientSession when websocket handshake fails#29388

Open
StartupBros wants to merge 1 commit into
NousResearch:mainfrom
StartupBros:fix/homeassistant-websocket-lifecycle
Open

fix(homeassistant): cleanup ClientSession when websocket handshake fails#29388
StartupBros wants to merge 1 commit into
NousResearch:mainfrom
StartupBros:fix/homeassistant-websocket-lifecycle

Conversation

@StartupBros

Copy link
Copy Markdown

Split out from #29199 per CONTRIBUTING.md ("one logical change per PR"). #29199 now scopes to the env-var split only; this PR is the websocket connection-lifecycle fix it originally bundled.

Problem

HomeAssistantAdapter._ws_connect opens an aiohttp.ClientSession and then runs a 4-step handshake (ws_connectauth_requiredauthsubscribe_events). Any of those steps can raise — TCP reset, peer-closed-mid-handshake, JSON decode error — but the failure path didn't run _cleanup_ws(). The leaked ClientSession then bites on the next reconnect attempt: the new attempt assigns over self._session while the old one is still open, and the old one is GC'd with its TCP connection still in CLOSE_WAIT. Later requests on the new session surface as cryptic "Session is closed" or "Connector is closed" errors that don't point at the original handshake failure.

Fix

Wrap the whole handshake in try / except Exception: so that any error tears down the ClientSession (and the WebSocket if it opened) via _cleanup_ws() before re-raising. self._ws and self._session are reset to None, so the next _ws_connect attempt starts from a clean slate.

The except block re-raises after cleanup — callers (connect()) keep their own broad-except that turns this into a return-False for the reconnect ladder; cleanup is purely additive.

Test plan

New TestWsConnectLifecycle class in tests/gateway/test_homeassistant.py covers both failure points:

  • test_ws_connect_failure_cleans_up_sessionClientSession.ws_connect raises before the handshake starts; session is closed, self._session is None.
  • test_auth_handshake_failure_cleans_up_ws_and_sessionws_connect succeeds but receive_json raises mid-handshake; both the WebSocket and the ClientSession get closed.

Full file: 47 tests, all green.

A failure between ``aiohttp.ClientSession.ws_connect`` and the final
``subscribe_events`` ack used to leave ``self._session`` open. The next
reconnect would then assign over the live handle and the old session would
be garbage-collected with its TCP connection still in CLOSE_WAIT, surfacing
later as cryptic ``"Session is closed"`` or ``"Connector is closed"``
aiohttp errors on subsequent requests.

Wrap the whole handshake in try/except so that any error tears down the
``ClientSession`` (and the WebSocket if it opened) via ``_cleanup_ws()``
before re-raising. ``self._ws`` and ``self._session`` are reset to ``None``
so the next ``_ws_connect`` attempt starts from a clean slate.

Tests in ``TestWsConnectLifecycle`` cover both failure points:
- ``ws_connect`` itself raising (TCP reset before handshake starts)
- ``receive_json`` raising mid-handshake (peer closes after the connect)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P3 Low — cosmetic, nice to have type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants