Skip to content

fix(gateway): close kanban DB connection after dispatch tick#33113

Closed
cookily wants to merge 1 commit into
NousResearch:mainfrom
cookily:fix/kanban-fd-leak
Closed

fix(gateway): close kanban DB connection after dispatch tick#33113
cookily wants to merge 1 commit into
NousResearch:mainfrom
cookily:fix/kanban-fd-leak

Conversation

@cookily

@cookily cookily commented May 27, 2026

Copy link
Copy Markdown

Problem

The kanban dispatcher's _tick_once_for_board() in gateway/run.py opens a SQLite connection via _kb.connect() on every 60-second dispatch tick, but never closes it. Over time (hours to days), this accumulates 50+ concurrent file descriptors to kanban.db, causing:

  1. SQLite disk I/O error when too many connections contend for file locks
  2. Eventual database corruption (database disk image is malformed)

Root Cause

# Before (leaks one FD per tick):
conn = _kb.connect(board=slug)
return _kb.dispatch_once(conn, ...)
# Finally block has conn.close() but exception paths skip it

The same function in kanban_db.py line 5475 already uses the correct pattern:

with contextlib.closing(connect()) as conn:
    res = dispatch_once(conn, ...)

Fix

Wrap the connection in contextlib.closing() so it is reliably closed on both normal return and exception paths. Also removes the now-redundant finally: conn.close() block.

Changes: 1 file, +16/-21 lines

Verification

Reproduced in production (Hermes gateway running 2+ hours):

  • Before fix: 56 kanban.db FDs → dispatch errors every 60 seconds → DB corruption within hours
  • After fix: 9 kanban.db FDs stable → zero dispatch errors after 12+ hours

The kanban dispatcher's _tick_once_for_board() opens a SQLite connection
via _kb.connect() but never closes it, leaking one file descriptor per
60-second tick. Over time this accumulates 50+ concurrent connections,
causing SQLite 'disk I/O error' and eventual database corruption.

Fix: wrap the connection in contextlib.closing() so it is reliably
closed on both normal return and exception paths. Also removes the
now-redundant finally: conn.close() block.

This mirrors the same pattern already used in kanban_db.py's dispatch
loop (line 5475).
@alt-glitch alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/gateway Gateway runner, session dispatch, delivery comp/plugins Plugin system and bundled plugins labels May 27, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Related to #29610 (root issue for kanban dispatcher SQLite connection/FD leak). This PR applies contextlib.closing() specifically to _tick_once_for_board() in gateway/run.py. See also: #31736 (dup of #29610), #31768 (broader dispatcher fix), #32226 (shared per-board connection cache), #29525 (decompose path leak).

@kshitijk4poor

Copy link
Copy Markdown
Collaborator

Closing as already fixed on main. The dispatcher tick now closes the connection via a try/finally conn.close() block inside _tick_once_for_board() in gateway/run.py (lines 5538-5543). That's slightly more robust than contextlib.closing() because it handles the conn is None case explicitly (when _kb.connect() raises before assignment). Thanks for catching the leak.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery comp/plugins Plugin system and bundled plugins P3 Low — cosmetic, nice to have type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants