Skip to content

kanban dispatcher FD leak: SQLite connections not releasing file descriptors in WAL mode #30799

@siysun

Description

@siysun

Summary

The kanban dispatcher in the gateway opens a new SQLite connection on every tick via kanban_db.connect(), but the file descriptors are not released even after conn.close() (called in finally blocks). After ~14 hours of runtime, the gateway process accumulates ~500 open FDs for kanban.db and ~500 for kanban.db-wal, hitting the OS soft limit (1024) and causing cascading failures.

Symptoms

  • Feishu/飞书: HTTPS connection to open.feishu.cn fails with [Errno 24] Too many open files
  • Kanban dispatcher: sqlite3.OperationalError: unable to open database file at kanban_db.py:990

Root Cause

  • kanban_db.connect() (line 990) opens a new sqlite3.connect() call every invocation — no connection pooling or reuse
  • The dispatcher calls connect() on every tick (~60s) via _tick_once_for_board() and _ready_nonempty()
  • Although conn.close() is called in finally blocks, SQLite WAL mode appears to keep the WAL file descriptor open even after close
  • Observed: 499 FDs for kanban.db + 498 for kanban.db-wal = 997 FDs (gateway PID 73, FD limit was 1024)
  • The .db-wal FDs correspond 1:1 with .db FDs, suggesting each WAL connection holds a file open after close

Affected Code

  • hermes_cli/kanban_db.py:961-1018connect() creates new connection every call
  • gateway/run.py:4890_tick_once_for_board() opens connection per tick
  • gateway/run.py:4967_ready_nonempty() opens connection per tick

Suggested Fix

Options (from least to most invasive):

  1. Single persistent connection: cache one connection per board slug and reuse it across ticks, only reopening on error
  2. Explicit WAL checkpoint before close: call conn.execute("PRAGMA wal_checkpoint(TRUNCATE)") before close to force SQLite to release WAL FDs
  3. Investigate Python GC interaction: del conn before conn.close(), or gc.collect() — CPython may be deferring SQLite finalizer

Workaround (applied on site)

Raised FD soft limit to 65536 via prlimit. This buys time but the leak will eventually hit the new limit too.

Environment

  • WSL2 (Ubuntu)
  • Python 3.14
  • Hermes Agent commit cc94195ea

Metadata

Metadata

Assignees

No one assigned

    Labels

    P3Low — cosmetic, nice to havecomp/gatewayGateway runner, session dispatch, deliverycomp/pluginsPlugin system and bundled pluginstype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions