Summary
The kanban dispatcher in the gateway opens a new SQLite connection on every tick via kanban_db.connect(), but the file descriptors are not released even after conn.close() (called in finally blocks). After ~14 hours of runtime, the gateway process accumulates ~500 open FDs for kanban.db and ~500 for kanban.db-wal, hitting the OS soft limit (1024) and causing cascading failures.
Symptoms
- Feishu/飞书: HTTPS connection to open.feishu.cn fails with
[Errno 24] Too many open files
- Kanban dispatcher:
sqlite3.OperationalError: unable to open database file at kanban_db.py:990
Root Cause
kanban_db.connect() (line 990) opens a new sqlite3.connect() call every invocation — no connection pooling or reuse
- The dispatcher calls
connect() on every tick (~60s) via _tick_once_for_board() and _ready_nonempty()
- Although
conn.close() is called in finally blocks, SQLite WAL mode appears to keep the WAL file descriptor open even after close
- Observed: 499 FDs for
kanban.db + 498 for kanban.db-wal = 997 FDs (gateway PID 73, FD limit was 1024)
- The
.db-wal FDs correspond 1:1 with .db FDs, suggesting each WAL connection holds a file open after close
Affected Code
hermes_cli/kanban_db.py:961-1018 — connect() creates new connection every call
gateway/run.py:4890 — _tick_once_for_board() opens connection per tick
gateway/run.py:4967 — _ready_nonempty() opens connection per tick
Suggested Fix
Options (from least to most invasive):
- Single persistent connection: cache one connection per board slug and reuse it across ticks, only reopening on error
- Explicit WAL checkpoint before close: call
conn.execute("PRAGMA wal_checkpoint(TRUNCATE)") before close to force SQLite to release WAL FDs
- Investigate Python GC interaction:
del conn before conn.close(), or gc.collect() — CPython may be deferring SQLite finalizer
Workaround (applied on site)
Raised FD soft limit to 65536 via prlimit. This buys time but the leak will eventually hit the new limit too.
Environment
- WSL2 (Ubuntu)
- Python 3.14
- Hermes Agent commit cc94195ea
Summary
The kanban dispatcher in the gateway opens a new SQLite connection on every tick via
kanban_db.connect(), but the file descriptors are not released even afterconn.close()(called infinallyblocks). After ~14 hours of runtime, the gateway process accumulates ~500 open FDs forkanban.dband ~500 forkanban.db-wal, hitting the OS soft limit (1024) and causing cascading failures.Symptoms
[Errno 24] Too many open filessqlite3.OperationalError: unable to open database fileatkanban_db.py:990Root Cause
kanban_db.connect()(line 990) opens a newsqlite3.connect()call every invocation — no connection pooling or reuseconnect()on every tick (~60s) via_tick_once_for_board()and_ready_nonempty()conn.close()is called infinallyblocks, SQLite WAL mode appears to keep the WAL file descriptor open even after closekanban.db+ 498 forkanban.db-wal= 997 FDs (gateway PID 73, FD limit was 1024).db-walFDs correspond 1:1 with.dbFDs, suggesting each WAL connection holds a file open after closeAffected Code
hermes_cli/kanban_db.py:961-1018—connect()creates new connection every callgateway/run.py:4890—_tick_once_for_board()opens connection per tickgateway/run.py:4967—_ready_nonempty()opens connection per tickSuggested Fix
Options (from least to most invasive):
conn.execute("PRAGMA wal_checkpoint(TRUNCATE)")before close to force SQLite to release WAL FDsdel connbeforeconn.close(), orgc.collect()— CPython may be deferring SQLite finalizerWorkaround (applied on site)
Raised FD soft limit to 65536 via
prlimit. This buys time but the leak will eventually hit the new limit too.Environment