Skip to content

fix(gateway): close adapter resources on failed reconnect to prevent fd leak#37053

Closed
uzunkuyruk wants to merge 2 commits into
NousResearch:mainfrom
uzunkuyruk:fix/gateway-reconnect-fd-leak
Closed

fix(gateway): close adapter resources on failed reconnect to prevent fd leak#37053
uzunkuyruk wants to merge 2 commits into
NousResearch:mainfrom
uzunkuyruk:fix/gateway-reconnect-fd-leak

Conversation

@uzunkuyruk

Copy link
Copy Markdown
Contributor

What does this PR do?

Fixes a file descriptor leak in the platform reconnect loop that causes the gateway to exhaust the OS fd limit (~2560) after 12–26 hours of uptime, making it completely non-functional.

Root cause: When a platform fails to connect and is queued for reconnect, the loop creates a fresh adapter on every retry. For APIServerAdapter, __init__ opens a sqlite3.connect() (db + WAL = 2 fds). On all three failure paths the new adapter was simply dropped — never disconnected, never cleaned up leaking 2 fds per attempt. At the 300s backoff cap that's ~12 leaked fds/hour. After ~26h the gateway hits the 2560 fd limit causing OSError: [Errno 24] Too many open files on every file operation, silently killing all platform connections.

Note: The initial connection path already calls _safe_adapter_disconnect() on failure this PR only fixes the reconnect loop which was missing cleanup.

Three-part fix:

  • _dispose_failed_adapter() new helper that calls adapter.disconnect() in a try/except so cleanup errors never crash the reconnect watcher
  • Called in all three failure paths in the reconnect loop: non-retryable error, retryable failure, and exception during connect
  • adapter = None initialized before the try block so exception-path cleanup never raises NameError
  • APIServerAdapter.disconnect() now also closes the ResponseStore sqlite3 connection previously only the aiohttp server was stopped, leaving the db connection open even on clean shutdown

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)

References

Fixes #37011

Checklist

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits
  • My PR contains only changes related to this fix

…fd leak

When a platform fails to connect and is queued for reconnect, the loop
creates a fresh adapter on every retry attempt. For APIServerAdapter,
__init__ opens a sqlite3.connect() (db + WAL = 2 fds). On all three
failure paths the new adapter was simply dropped — never disconnected,
never cleaned up — leaking 2 fds per attempt.

At the 300s backoff cap that is ~12 leaked fds/hour. On a 26-hour
uptime this exhausts the 2560 fd limit, causing OSError: [Errno 24]
Too many open files on every new file operation, silently killing all
platform connections and making the gateway non-functional.

Three-part fix:
- Add _dispose_failed_adapter() helper that calls adapter.disconnect()
  in a try/except so cleanup errors never crash the reconnect watcher
- Call it in all three failure paths in the reconnect loop:
  non-retryable error, retryable failure, and exception during connect
- Fix APIServerAdapter.disconnect() to also close the ResponseStore
  sqlite3 connection — previously only the aiohttp server was stopped,
  leaving the db connection open even on clean shutdown

Fixes NousResearch#37011
…tion cleanup

If _create_adapter raises before assigning the local variable, the
except block's _dispose_failed_adapter(adapter) call would raise
NameError. Initialize adapter=None so the cleanup path is always safe.
@liuhao1024

Copy link
Copy Markdown
Contributor

Verified all three reconnect-failure paths in _platform_reconnect_watcher now call _dispose_failed_adapter:

  1. Non-retryable fatal error (line ~6201): adapter created, fatal_error_message set → dispose before del self._failed_platforms[platform]
  2. Retryable failure (line ~6210): adapter created, not fatal → dispose before updating runtime status and scheduling backoff
  3. Exception during reconnect (line ~6243): adapter may or may not be fully initialized → dispose (guarded by if adapter is None: return)

The adapter = None initialization at the top of the try block (line 6154) is correct defensive programming — ensures the except-path cleanup variable is always bound even if _create_adapter raises before assignment.

The _dispose_failed_adapter helper correctly swallows exceptions from adapter.disconnect() at DEBUG level — a disposal failure should not mask the original reconnect error.

Also verified the api_server.py change: ResponseStore.close() in disconnect() releases the sqlite3 db + WAL file descriptors. Combined with the reconnect-loop disposal, this closes the fd leak path documented in the docstring (~12-26 hours to fd exhaustion on 300s backoff).

Clean fix, no issues found.

@tonydwb tonydwb left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: APPROVE

This PR correctly fixes a real file descriptor leak in the gateway reconnect loop that causes OSError: [Errno 24] Too many open files after ~12-26 hours of uptime.

Three failure paths — all covered:

  1. Non-retryable error (line 6201-6202): await self._dispose_failed_adapter(adapter, platform) called before del self._failed_platforms[platform]
  2. Retryable failure (line 6210-6211): await self._dispose_failed_adapter(adapter, platform) called before backoff update
  3. Exception during connect (line 6245): await self._dispose_failed_adapter(adapter, platform) in the except block, guarded by adapter = None initialization at line 6154

_dispose_failed_adapter helper is robust:

  • Null-safe: returns immediately if adapter is None (covers the adapter-creation-None path at line 6156 which just uses continue before adapter is set)
  • Exception-safe: wraps adapter.disconnect() in try/except so cleanup errors never crash the reconnect watcher
  • Calls the proper adapter.disconnect() method which every platform adapter implements

APIServerAdapter.disconnect() improvement:

  • Now also closes self._response_store (sqlite3 connection), releasing 2 fds (db + WAL) that were previously leaked even on clean shutdown. This is the primary source of the leak mentioned in the root cause analysis.

Tests: The PR has no test changes (0 deletions), which is acceptable:

  • The existing test_platform_reconnect.py tests use StubAdapter.disconnect() which is a no-op async — these continue to pass with the dispose calls inserted
  • The fix is structurally simple: inserting a careful helper call in three well-defined locations
  • The _safe_adapter_disconnect existing helper (used in startup failure paths) proves the codebase already trusts this disconnect-on-failure pattern
  • A regression test that verifies disconnect() is called on failed reconnects would be nice-to-have but not a blocker given the simplicity and guard-rails of the helper

Nit (non-blocking): The _dispose_failed_adapter helper is quite similar to _safe_adapter_disconnect (lines 2222-2250). The main difference is _dispose_failed_adapter uses a timeout and logs at DEBUG vs the existing one which logs at WARNING on timeout. Given this is a hot reconnect loop, the lighter-weight approach makes sense — but future cleanup could consolidate them.

Verdict: APPROVE — clean, scoped, correctly targets all three failure paths with a robust dispose helper. Fixes a real production issue (fd exhaustion).

@alt-glitch alt-glitch added type/bug Something isn't working P1 High — major feature broken, no workaround comp/gateway Gateway runner, session dispatch, delivery labels Jun 2, 2026
@teknium1

teknium1 commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Thanks for this — your diagnosis was spot on. The fd leak from APIServerAdapter's ResponseStore sqlite3 connection not being closed when an unowned adapter is dropped from the reconnect retry queue is exactly the root cause.

We shipped a parallel fix on main in commit 4b06c98 ("fix(gateway): close ResponseStore + dispose unowned adapter on reconnect failure") that does the same thing your PR does:

  • Disposes the newly-created adapter in all three reconnect failure paths (non-retryable, retryable, and exception-during-connect) via _dispose_unused_adapter() before dropping it from _failed_platforms.
  • Closes self._response_store in APIServerAdapter.disconnect() so the db + WAL fds are released.

It references #37011 directly. Because that landed independently and #37011 is already closed, this PR is now redundant, so I'm closing it — but the credit for the correct diagnosis and approach is yours. Thanks for taking the time to write it up so clearly.

@teknium1 teknium1 closed this Jun 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P1 High — major feature broken, no workaround type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: File descriptor leak in platform reconnect loop — adapter sqlite3 connections never closed on failed reconnect - gateway dead after ~12h uptime

5 participants