Skip to content

fix(kanban): close decomposer SQLite connections to stop fd leak#29525

Closed
abeperl wants to merge 1 commit into
NousResearch:mainfrom
abeperl:fix/kanban-decompose-fd-leak
Closed

fix(kanban): close decomposer SQLite connections to stop fd leak#29525
abeperl wants to merge 1 commit into
NousResearch:mainfrom
abeperl:fix/kanban-decompose-fd-leak

Conversation

@abeperl

@abeperl abeperl commented May 20, 2026

Copy link
Copy Markdown

Summary

hermes_cli/kanban_decompose.py opened SQLite connections with with kb.connect() as conn:. Python's sqlite3 connection context manager commits/rolls back the transaction but does not close the connection — so the underlying file descriptors (main DB + WAL) are only released on garbage collection, and connections held by reference cycles linger.

list_triage_ids() is called on every gateway dispatcher tick, for every board (via the auto-decompose loop in gateway/run.py), so this leaked ~1 connection (2 fds) per tick per board.

Impact (observed in production)

On a long-running gateway this exhausted the process's open-file limit (default soft RLIMIT_NOFILE of 1024) in ~5 hours. Once the fd table was full:

  • the Slack Socket Mode client could no longer open sockets and every reconnect failed with ClientConnectorDNSError: Cannot connect to host slack.com:443 ssl:default [Invalid argument] (EINVAL) — the agent went silent on Slack while the process stayed "active";
  • the kanban DB itself started throwing sqlite3.OperationalError: unable to open database file.

Host DNS/TLS to Slack was fine throughout — it was purely fd starvation. /proc/<pid>/fd showed ~993 of 1024 fds held open against kanban.db / kanban.db-wal across boards.

Fix

Wrap all four kb.connect() sites in kanban_decompose.py with contextlib.closing() so the connection is deterministically closed when the block exits. Verified on a live gateway: kanban fds dropped from ~993 to 0 and stay flat across ticks; Slack reconnects immediately.

Note for maintainers

The same with kb.connect() as conn: pattern appears in other modules that are not on the gateway hot path (hermes_cli/kanban_specify.py has an identical list_triage_ids, and there are ~34 sites in hermes_cli/kanban.py). They're latent leaks rather than active ones, so I've kept this PR scoped to the proven culprit. Happy to follow up with a sweep of the rest if you'd prefer them fixed in one go.

Test plan

  • CI green
  • Manual: ran a live gateway with the patch — ls /proc/<pid>/fd | grep -c kanban stays at 0 across many dispatcher ticks (previously climbed ~1/tick/board to the 1024 ceiling)
  • Manual: Slack Socket Mode reconnects and stays connected (ss -tnp shows an ESTABLISHED websocket to wss-primary.slack.com)

list_triage_ids() runs every gateway dispatcher tick per board and used
'with kb.connect() as conn:'. Python's sqlite3 connection context manager
commits/rolls back the transaction but does NOT close the connection, so
each tick leaked a connection (db+wal = 2 fds). Over ~5h this exhausted
the 1024 fd soft limit, starving the Slack websocket client of sockets
(ClientConnectorDNSError / EINVAL) and the kanban DB of file handles.

Wrap all four kb.connect() sites in contextlib.closing().
@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/cli CLI entry point, hermes_cli/, setup wizard labels May 20, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Related: #28802 (same class of SQLite connection leak in kanban_specify helpers), #28803 (companion fix for specify path). This PR targets the decomposer hot path specifically — distinct code path from the specify helpers but same root cause pattern.

@kshitijk4poor

Copy link
Copy Markdown
Collaborator

Closing as already fixed on main — landed via commit ebe04c66c (fix(kanban): close kanban.db FD after every connect() in long-lived processes), which introduced the kb.connect_closing() context manager and converted all kanban_decompose.py connection sites to use it. Same fix, different idiom. Thanks for catching the leak.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/cli CLI entry point, hermes_cli/, setup wizard P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants