Skip to content

Gateway embedded Kanban dispatcher opens SQLite WAL connections every tick, causing FD/WAL pressure #31736

@faisfamilytravel

Description

@faisfamilytravel

Draft upstream issue — Kanban dispatcher persistent connection / WAL FD pressure

Title: Gateway embedded Kanban dispatcher opens SQLite WAL connections every tick, causing FD/WAL pressure

Summary

The gateway embedded Kanban dispatcher currently opens and closes Kanban SQLite connections on every dispatcher tick. The dispatch path opens one connection per board, and the health telemetry path opens another connection per board on the same tick. In a long-running gateway process with a short dispatch interval, this creates repeated SQLite WAL/SHM connection churn and file descriptor pressure.

We observed this locally in a long-running Mission Control gateway process: lsof showed multiple open handles for kanban.db, kanban.db-wal, and kanban.db-shm. A prior local patch (DEC-2026-05-23-024) mitigated the close path by using a _WalSafeConnection that runs PRAGMA wal_checkpoint(TRUNCATE) before close, but that does not remove the underlying dispatcher churn pattern.

Affected area

  • gateway/run.py
  • Embedded _kanban_dispatcher_watcher()
  • Kanban DB dispatch path and dispatcher health probe

Current behavior

Per tick:

  1. _tick_once_for_board() opens _kb.connect(board=slug), calls _kb.dispatch_once(...), then closes the connection.
  2. _ready_nonempty() opens another _kb.connect(board=slug) for health telemetry, checks spawnable ready/review tasks, then closes the connection.
  3. The watcher uses asyncio.to_thread(...), so work may run on arbitrary default executor threads across ticks.

This is safe for event-loop blocking but unfavorable for persistent SQLite connection reuse because default sqlite connections are thread-affine.

Expected behavior

The embedded dispatcher should avoid per-tick SQLite WAL connection churn while keeping DB work off the event loop and preserving sqlite thread affinity.

Proposed fix

Use a dedicated single-thread ThreadPoolExecutor for dispatcher DB work and maintain a per-board persistent SQLite connection cache inside the dispatcher watcher:

  • one executor thread named kanban-dispatcher,
  • one cached connection per active board,
  • dispatch and ready/review health probes share the cached board connection,
  • fingerprint changes close and reopen the cached connection,
  • corrupt-board handling closes/discards cached connection and suppresses retry until DB fingerprint changes,
  • watcher shutdown/cancellation closes all cached connections on the dispatcher executor thread.

This is upstreamable because it is a minimal runtime change and does not add deployment-specific assumptions.

Local validation

Focused tests added locally:

  • tests/gateway/test_kanban_dispatcher.py::test_kanban_dispatcher_uses_dedicated_single_thread_executor
  • tests/gateway/test_kanban_dispatcher.py::test_kanban_dispatcher_reuses_board_connection_across_ticks
  • tests/gateway/test_kanban_dispatcher.py::test_kanban_dispatcher_health_probe_uses_cached_connection
  • tests/gateway/test_kanban_dispatcher.py::test_kanban_dispatcher_closes_cached_connection_on_shutdown
  • tests/gateway/test_kanban_dispatcher.py::test_kanban_dispatcher_reopens_cached_connection_when_fingerprint_changes
  • tests/gateway/test_kanban_dispatcher.py::test_kanban_dispatcher_corrupt_board_closes_and_suppresses_until_fingerprint_changes

Command:

venv/bin/python -m pytest tests/gateway/test_kanban_dispatcher.py tests/hermes_cli/test_kanban_db.py -q

Result:

178 passed in 4.59s

Related local evidence

Local deviation DEC-2026-05-23-024 previously addressed the close-path symptom with _WalSafeConnection.close() running PRAGMA wal_checkpoint(TRUNCATE) before super().close(). This issue is the underlying dispatcher lifecycle problem: repeated per-tick open/close cycles. The persistent dispatcher connection refactor reduces dependence on the close-path mitigation but does not replace the need for safe close behavior in the public kanban_db.connect() API.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P3Low — cosmetic, nice to havecomp/gatewayGateway runner, session dispatch, deliverytype/perfPerformance improvement or optimization

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions