fix(kanban): close decomposer SQLite connections to stop fd leak#29525
Closed
abeperl wants to merge 1 commit into
Closed
fix(kanban): close decomposer SQLite connections to stop fd leak#29525abeperl wants to merge 1 commit into
abeperl wants to merge 1 commit into
Conversation
list_triage_ids() runs every gateway dispatcher tick per board and used 'with kb.connect() as conn:'. Python's sqlite3 connection context manager commits/rolls back the transaction but does NOT close the connection, so each tick leaked a connection (db+wal = 2 fds). Over ~5h this exhausted the 1024 fd soft limit, starving the Slack websocket client of sockets (ClientConnectorDNSError / EINVAL) and the kanban DB of file handles. Wrap all four kb.connect() sites in contextlib.closing().
Collaborator
12 tasks
This was referenced May 26, 2026
Collaborator
|
Closing as already fixed on main — landed via commit |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
hermes_cli/kanban_decompose.pyopened SQLite connections withwith kb.connect() as conn:. Python'ssqlite3connection context manager commits/rolls back the transaction but does not close the connection — so the underlying file descriptors (main DB + WAL) are only released on garbage collection, and connections held by reference cycles linger.list_triage_ids()is called on every gateway dispatcher tick, for every board (via the auto-decompose loop ingateway/run.py), so this leaked ~1 connection (2 fds) per tick per board.Impact (observed in production)
On a long-running gateway this exhausted the process's open-file limit (default soft
RLIMIT_NOFILEof 1024) in ~5 hours. Once the fd table was full:ClientConnectorDNSError: Cannot connect to host slack.com:443 ssl:default [Invalid argument](EINVAL) — the agent went silent on Slack while the process stayed "active";sqlite3.OperationalError: unable to open database file.Host DNS/TLS to Slack was fine throughout — it was purely fd starvation.
/proc/<pid>/fdshowed ~993 of 1024 fds held open againstkanban.db/kanban.db-walacross boards.Fix
Wrap all four
kb.connect()sites inkanban_decompose.pywithcontextlib.closing()so the connection is deterministically closed when the block exits. Verified on a live gateway: kanban fds dropped from ~993 to 0 and stay flat across ticks; Slack reconnects immediately.Note for maintainers
The same
with kb.connect() as conn:pattern appears in other modules that are not on the gateway hot path (hermes_cli/kanban_specify.pyhas an identicallist_triage_ids, and there are ~34 sites inhermes_cli/kanban.py). They're latent leaks rather than active ones, so I've kept this PR scoped to the proven culprit. Happy to follow up with a sweep of the rest if you'd prefer them fixed in one go.Test plan
ls /proc/<pid>/fd | grep -c kanbanstays at 0 across many dispatcher ticks (previously climbed ~1/tick/board to the 1024 ceiling)ss -tnpshows an ESTABLISHED websocket towss-primary.slack.com)