fix(gateway): close adapter resources on failed reconnect to prevent fd leak by uzunkuyruk · Pull Request #37053 · NousResearch/hermes-agent

uzunkuyruk · 2026-06-01T22:54:12Z

What does this PR do?

Fixes a file descriptor leak in the platform reconnect loop that causes the gateway to exhaust the OS fd limit (~2560) after 12–26 hours of uptime, making it completely non-functional.

Root cause: When a platform fails to connect and is queued for reconnect, the loop creates a fresh adapter on every retry. For APIServerAdapter, __init__ opens a sqlite3.connect() (db + WAL = 2 fds). On all three failure paths the new adapter was simply dropped — never disconnected, never cleaned up leaking 2 fds per attempt. At the 300s backoff cap that's ~12 leaked fds/hour. After ~26h the gateway hits the 2560 fd limit causing OSError: [Errno 24] Too many open files on every file operation, silently killing all platform connections.

Note: The initial connection path already calls _safe_adapter_disconnect() on failure this PR only fixes the reconnect loop which was missing cleanup.

Three-part fix:

_dispose_failed_adapter() new helper that calls adapter.disconnect() in a try/except so cleanup errors never crash the reconnect watcher
Called in all three failure paths in the reconnect loop: non-retryable error, retryable failure, and exception during connect
adapter = None initialized before the try block so exception-path cleanup never raises NameError
APIServerAdapter.disconnect() now also closes the ResponseStore sqlite3 connection previously only the aiohttp server was stopped, leaving the db connection open even on clean shutdown

Type of Change

🐛 Bug fix (non-breaking change that fixes an issue)

References

Fixes #37011

Checklist

I've read the Contributing Guide
My commit messages follow Conventional Commits
My PR contains only changes related to this fix

…fd leak When a platform fails to connect and is queued for reconnect, the loop creates a fresh adapter on every retry attempt. For APIServerAdapter, __init__ opens a sqlite3.connect() (db + WAL = 2 fds). On all three failure paths the new adapter was simply dropped — never disconnected, never cleaned up — leaking 2 fds per attempt. At the 300s backoff cap that is ~12 leaked fds/hour. On a 26-hour uptime this exhausts the 2560 fd limit, causing OSError: [Errno 24] Too many open files on every new file operation, silently killing all platform connections and making the gateway non-functional. Three-part fix: - Add _dispose_failed_adapter() helper that calls adapter.disconnect() in a try/except so cleanup errors never crash the reconnect watcher - Call it in all three failure paths in the reconnect loop: non-retryable error, retryable failure, and exception during connect - Fix APIServerAdapter.disconnect() to also close the ResponseStore sqlite3 connection — previously only the aiohttp server was stopped, leaving the db connection open even on clean shutdown Fixes NousResearch#37011

…tion cleanup If _create_adapter raises before assigning the local variable, the except block's _dispose_failed_adapter(adapter) call would raise NameError. Initialize adapter=None so the cleanup path is always safe.

liuhao1024 · 2026-06-01T23:10:20Z

Verified all three reconnect-failure paths in _platform_reconnect_watcher now call _dispose_failed_adapter:

Non-retryable fatal error (line ~6201): adapter created, fatal_error_message set → dispose before del self._failed_platforms[platform]
Retryable failure (line ~6210): adapter created, not fatal → dispose before updating runtime status and scheduling backoff
Exception during reconnect (line ~6243): adapter may or may not be fully initialized → dispose (guarded by if adapter is None: return)

The adapter = None initialization at the top of the try block (line 6154) is correct defensive programming — ensures the except-path cleanup variable is always bound even if _create_adapter raises before assignment.

The _dispose_failed_adapter helper correctly swallows exceptions from adapter.disconnect() at DEBUG level — a disposal failure should not mask the original reconnect error.

Also verified the api_server.py change: ResponseStore.close() in disconnect() releases the sqlite3 db + WAL file descriptors. Combined with the reconnect-loop disposal, this closes the fd leak path documented in the docstring (~12-26 hours to fd exhaustion on 300s backoff).

Clean fix, no issues found.

tonydwb

Review: APPROVE

This PR correctly fixes a real file descriptor leak in the gateway reconnect loop that causes OSError: [Errno 24] Too many open files after ~12-26 hours of uptime.

Three failure paths — all covered:

Non-retryable error (line 6201-6202): await self._dispose_failed_adapter(adapter, platform) called before del self._failed_platforms[platform]
Retryable failure (line 6210-6211): await self._dispose_failed_adapter(adapter, platform) called before backoff update
Exception during connect (line 6245): await self._dispose_failed_adapter(adapter, platform) in the except block, guarded by adapter = None initialization at line 6154

_dispose_failed_adapter helper is robust:

Null-safe: returns immediately if adapter is None (covers the adapter-creation-None path at line 6156 which just uses continue before adapter is set)
Exception-safe: wraps adapter.disconnect() in try/except so cleanup errors never crash the reconnect watcher
Calls the proper adapter.disconnect() method which every platform adapter implements

APIServerAdapter.disconnect() improvement:

Now also closes self._response_store (sqlite3 connection), releasing 2 fds (db + WAL) that were previously leaked even on clean shutdown. This is the primary source of the leak mentioned in the root cause analysis.

Tests: The PR has no test changes (0 deletions), which is acceptable:

The existing test_platform_reconnect.py tests use StubAdapter.disconnect() which is a no-op async — these continue to pass with the dispose calls inserted
The fix is structurally simple: inserting a careful helper call in three well-defined locations
The _safe_adapter_disconnect existing helper (used in startup failure paths) proves the codebase already trusts this disconnect-on-failure pattern
A regression test that verifies disconnect() is called on failed reconnects would be nice-to-have but not a blocker given the simplicity and guard-rails of the helper

Nit (non-blocking): The _dispose_failed_adapter helper is quite similar to _safe_adapter_disconnect (lines 2222-2250). The main difference is _dispose_failed_adapter uses a timeout and logs at DEBUG vs the existing one which logs at WARNING on timeout. Given this is a hot reconnect loop, the lighter-weight approach makes sense — but future cleanup could consolidate them.

Verdict: APPROVE — clean, scoped, correctly targets all three failure paths with a robust dispose helper. Fixes a real production issue (fd exhaustion).

teknium1 · 2026-06-04T12:20:05Z

Thanks for this — your diagnosis was spot on. The fd leak from APIServerAdapter's ResponseStore sqlite3 connection not being closed when an unowned adapter is dropped from the reconnect retry queue is exactly the root cause.

We shipped a parallel fix on main in commit 4b06c98 ("fix(gateway): close ResponseStore + dispose unowned adapter on reconnect failure") that does the same thing your PR does:

Disposes the newly-created adapter in all three reconnect failure paths (non-retryable, retryable, and exception-during-connect) via _dispose_unused_adapter() before dropping it from _failed_platforms.
Closes self._response_store in APIServerAdapter.disconnect() so the db + WAL fds are released.

It references #37011 directly. Because that landed independently and #37011 is already closed, this PR is now redundant, so I'm closing it — but the credit for the correct diagnosis and approach is yours. Thanks for taking the time to write it up so clearly.

uzunkuyruk added 2 commits June 2, 2026 01:43

tonydwb approved these changes Jun 1, 2026

View reviewed changes

alt-glitch mentioned this pull request Jun 2, 2026

fix(gateway): close ResponseStore DB connection on adapter disconnect #37110

Open

alt-glitch added type/bug Something isn't working P1 High — major feature broken, no workaround comp/gateway Gateway runner, session dispatch, delivery labels Jun 2, 2026

BingqingLyu mentioned this pull request Jun 3, 2026

[Feature]: Add cross-PR conflict detection to github-code-review skill #38054

Open

teknium1 closed this Jun 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(gateway): close adapter resources on failed reconnect to prevent fd leak#37053

fix(gateway): close adapter resources on failed reconnect to prevent fd leak#37053
uzunkuyruk wants to merge 2 commits into
NousResearch:mainfrom
uzunkuyruk:fix/gateway-reconnect-fd-leak

uzunkuyruk commented Jun 1, 2026

Uh oh!

liuhao1024 commented Jun 1, 2026

Uh oh!

tonydwb left a comment

Uh oh!

teknium1 commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

uzunkuyruk commented Jun 1, 2026

What does this PR do?

Type of Change

References

Checklist

Uh oh!

liuhao1024 commented Jun 1, 2026

Uh oh!

tonydwb left a comment

Choose a reason for hiding this comment

Uh oh!

teknium1 commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants