fix(gateway): close adapter resources on failed reconnect to prevent fd leak#37053
fix(gateway): close adapter resources on failed reconnect to prevent fd leak#37053uzunkuyruk wants to merge 2 commits into
Conversation
…fd leak When a platform fails to connect and is queued for reconnect, the loop creates a fresh adapter on every retry attempt. For APIServerAdapter, __init__ opens a sqlite3.connect() (db + WAL = 2 fds). On all three failure paths the new adapter was simply dropped — never disconnected, never cleaned up — leaking 2 fds per attempt. At the 300s backoff cap that is ~12 leaked fds/hour. On a 26-hour uptime this exhausts the 2560 fd limit, causing OSError: [Errno 24] Too many open files on every new file operation, silently killing all platform connections and making the gateway non-functional. Three-part fix: - Add _dispose_failed_adapter() helper that calls adapter.disconnect() in a try/except so cleanup errors never crash the reconnect watcher - Call it in all three failure paths in the reconnect loop: non-retryable error, retryable failure, and exception during connect - Fix APIServerAdapter.disconnect() to also close the ResponseStore sqlite3 connection — previously only the aiohttp server was stopped, leaving the db connection open even on clean shutdown Fixes NousResearch#37011
…tion cleanup If _create_adapter raises before assigning the local variable, the except block's _dispose_failed_adapter(adapter) call would raise NameError. Initialize adapter=None so the cleanup path is always safe.
|
Verified all three reconnect-failure paths in
The The Also verified the Clean fix, no issues found. |
tonydwb
left a comment
There was a problem hiding this comment.
Review: APPROVE
This PR correctly fixes a real file descriptor leak in the gateway reconnect loop that causes OSError: [Errno 24] Too many open files after ~12-26 hours of uptime.
Three failure paths — all covered:
- Non-retryable error (line 6201-6202):
await self._dispose_failed_adapter(adapter, platform)called beforedel self._failed_platforms[platform] - Retryable failure (line 6210-6211):
await self._dispose_failed_adapter(adapter, platform)called before backoff update - Exception during connect (line 6245):
await self._dispose_failed_adapter(adapter, platform)in theexceptblock, guarded byadapter = Noneinitialization at line 6154
_dispose_failed_adapter helper is robust:
- Null-safe: returns immediately if
adapter is None(covers the adapter-creation-None path at line 6156 which just usescontinuebefore adapter is set) - Exception-safe: wraps
adapter.disconnect()in try/except so cleanup errors never crash the reconnect watcher - Calls the proper
adapter.disconnect()method which every platform adapter implements
APIServerAdapter.disconnect() improvement:
- Now also closes
self._response_store(sqlite3 connection), releasing 2 fds (db + WAL) that were previously leaked even on clean shutdown. This is the primary source of the leak mentioned in the root cause analysis.
Tests: The PR has no test changes (0 deletions), which is acceptable:
- The existing
test_platform_reconnect.pytests useStubAdapter.disconnect()which is a no-op async — these continue to pass with the dispose calls inserted - The fix is structurally simple: inserting a careful helper call in three well-defined locations
- The
_safe_adapter_disconnectexisting helper (used in startup failure paths) proves the codebase already trusts this disconnect-on-failure pattern - A regression test that verifies
disconnect()is called on failed reconnects would be nice-to-have but not a blocker given the simplicity and guard-rails of the helper
Nit (non-blocking): The _dispose_failed_adapter helper is quite similar to _safe_adapter_disconnect (lines 2222-2250). The main difference is _dispose_failed_adapter uses a timeout and logs at DEBUG vs the existing one which logs at WARNING on timeout. Given this is a hot reconnect loop, the lighter-weight approach makes sense — but future cleanup could consolidate them.
Verdict: APPROVE — clean, scoped, correctly targets all three failure paths with a robust dispose helper. Fixes a real production issue (fd exhaustion).
|
Thanks for this — your diagnosis was spot on. The fd leak from We shipped a parallel fix on
It references #37011 directly. Because that landed independently and #37011 is already closed, this PR is now redundant, so I'm closing it — but the credit for the correct diagnosis and approach is yours. Thanks for taking the time to write it up so clearly. |
What does this PR do?
Fixes a file descriptor leak in the platform reconnect loop that causes the gateway to exhaust the OS fd limit (~2560) after 12–26 hours of uptime, making it completely non-functional.
Root cause: When a platform fails to connect and is queued for reconnect, the loop creates a fresh adapter on every retry. For
APIServerAdapter,__init__opens asqlite3.connect()(db + WAL = 2 fds). On all three failure paths the new adapter was simply dropped — never disconnected, never cleaned up leaking 2 fds per attempt. At the 300s backoff cap that's ~12 leaked fds/hour. After ~26h the gateway hits the 2560 fd limit causingOSError: [Errno 24] Too many open fileson every file operation, silently killing all platform connections.Note: The initial connection path already calls
_safe_adapter_disconnect()on failure this PR only fixes the reconnect loop which was missing cleanup.Three-part fix:
_dispose_failed_adapter()new helper that callsadapter.disconnect()in a try/except so cleanup errors never crash the reconnect watcheradapter = Noneinitialized before the try block so exception-path cleanup never raisesNameErrorAPIServerAdapter.disconnect()now also closes theResponseStoresqlite3 connection previously only the aiohttp server was stopped, leaving the db connection open even on clean shutdownType of Change
References
Fixes #37011
Checklist