Skip to content

fix(kanban): close kanban.db FD after every connect() in long-lived processes#33564

Merged
teknium1 merged 1 commit into
mainfrom
hermes/hermes-6bc90445
May 28, 2026
Merged

fix(kanban): close kanban.db FD after every connect() in long-lived processes#33564
teknium1 merged 1 commit into
mainfrom
hermes/hermes-6bc90445

Conversation

@teknium1

@teknium1 teknium1 commented May 28, 2026

Copy link
Copy Markdown
Contributor

Summary

Closes the kanban.db FD leak by adding a real closing context manager and switching every leaky call site in long-lived processes to use it. Closes #33159.

sqlite3.Connection.__exit__ commits/rollbacks but does NOT close the underlying file descriptor. with kb.connect() as conn: in long-lived processes (gateway run_slash, dashboard decompose_task_endpoint) leaks one FD per call. The reporter's gateway died after ~4 days with [Errno 24] Too many open files.

Changes

  • hermes_cli/kanban_db.py: add connect_closing() — a real @contextlib.contextmanager that wraps connect() with try/finally: conn.close().
  • hermes_cli/kanban.py (35 sites), hermes_cli/kanban_decompose.py (4), hermes_cli/kanban_specify.py (3): switch with kb.connect(...) as conn:with kb.connect_closing(...) as conn:.
  • tests/hermes_cli/test_kanban_db.py: regression coverage — happy-path closure, exception-path closure, smoke test, plus an explicit "bare with kb.connect() still leaks" assertion documenting the upstream sqlite3 behaviour we're working around.

kanban.py is the most impactful because run_slash (called from the gateway for every /kanban slash command) dispatches argparse to those _cmd_* functions in-process — each was leaking one FD per invocation, accumulating until the gateway hit RLIMIT_NOFILE.

Test code under tests/ is intentionally untouched: short-lived processes where OS cleanup masks the leak.

Validation

Surface Result
tests/hermes_cli/test_kanban_db.py 198/198 pass (4 new regression tests)
Kanban suite (15 files) 728/729 pass — 1 pre-existing failure (test_worker_complete_rejects_stale_run_id) reproduces unchanged on origin/main, unrelated to this PR (caused by c002668 add grace period to detect_crashed_workers)

Closes

Credit

Reporter @kenshinsee correctly identified the leak shape. The specific handler names cited in the report (get_home_channels, decompose_task_endpoint) were not the actual leak sites on current main — get_home_channels already has try/finally: conn.close(), and decompose_task_endpoint doesn't open conn directly. The real leak surface is in kanban.py's _cmd_* handlers (reached via the gateway's /kanban slash dispatch) and the decompose_task / specify modules they call into. Same root cause as reported, wider blast radius than reported.

Infographic

pr-33564-kanban-fd-leak-fix

@github-actions

github-actions Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

🔎 Lint report: hermes/hermes-6bc90445 vs origin/main

ruff

Total: 0 on HEAD, 0 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 0 pre-existing issues carried over.

ty (type checker)

Total: 9519 on HEAD, 9519 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 5015 pre-existing issues carried over.

Diagnostics are surfaced as warnings — this check never fails the build.

teknium1 added a commit that referenced this pull request May 28, 2026
Two pre-existing test failures on main, both pointing at code that
was hardened recently — not behaviour bugs, test expectations that
fell out of date.

1. tests/tools/test_kanban_tools.py::test_worker_complete_rejects_stale_run_id
   c002668 ("fix(kanban): add grace period to detect_crashed_workers")
   gates each running task behind a launch-window grace period so
   freshly-spawned workers whose PID isn't yet visible on /proc don't
   get reclaimed. The test creates a worker_env fixture moments before
   asserting reclamation, so the default 30s grace skips the liveness
   check and detect_crashed_workers returns []. Fix: set
   HERMES_KANBAN_CRASH_GRACE_SECONDS=0 in the test so we get the
   immediate-reclaim semantics the assertion expects.

2. tests/tools/test_windows_native_support.py::
     TestKanbanWaitpidWindowsGuard::test_source_gates_waitpid_loop
   ffdc937 ("fix(kanban): hoist zombie reaper out of dispatch_once")
   reshaped reap_worker_zombies to use an early-return Windows guard
   (\`if os.name == "nt": return []\`) instead of an inverted gate
   (\`if os.name != "nt":\`). Both correctly keep the waitpid loop off
   Windows — the early-return form is stronger because the rest of the
   function never runs. Fix: accept either gate pattern in the source
   scan.

Both failures reproduce verbatim on \`origin/main\` in a clean env;
neither relates to in-flight work on #33564 (the FD-leak fix). Filing
this as a separate fix-it PR per green-CI-policy so the kanban CI
shard stays green for downstream PRs.
…rocesses

`sqlite3.Connection.__exit__` commits/rollbacks but does NOT close the
underlying FD. `with kb.connect() as conn:` in long-lived processes
(gateway `run_slash`, dashboard `decompose_task_endpoint`) therefore
leaks one FD to `kanban.db` per call. After enough operations the
gateway dies with `[Errno 24] Too many open files` (~4 days uptime
in the production report — #33159).

Fix: add a `connect_closing()` context manager in `hermes_cli/kanban_db`
that wraps `connect()` with a real `try/finally: conn.close()`. Switch
the 42 leak-prone call sites in `hermes_cli/kanban.py` (35),
`hermes_cli/kanban_decompose.py` (4), and `hermes_cli/kanban_specify.py`
(3) over to it.

`kanban.py` matters because `run_slash` (called from the gateway for
every `/kanban` slash command) parses argparse and dispatches to those
`_cmd_*` functions in-process — each one was leaking one FD per
invocation.

Tests inside `tests/` are untouched: short-lived processes where OS
cleanup masks the leak. Regression tests added in
`test_kanban_db.py` cover both happy-path and exception-path closure,
plus an explicit assertion that bare `with kb.connect()` still does
NOT close (documenting the upstream sqlite3 behaviour we're working
around).

Closes #33159.
@teknium1 teknium1 force-pushed the hermes/hermes-6bc90445 branch from b7ae6e2 to 0b9cf94 Compare May 28, 2026 01:27
@alt-glitch alt-glitch added type/bug Something isn't working comp/cli CLI entry point, hermes_cli/, setup wizard P3 Low — cosmetic, nice to have labels May 28, 2026
@teknium1 teknium1 merged commit ebe04c6 into main May 28, 2026
25 checks passed
@teknium1 teknium1 deleted the hermes/hermes-6bc90445 branch May 28, 2026 05:07
mathias3 pushed a commit to mathias3/hermes-agent that referenced this pull request May 28, 2026
Two pre-existing test failures on main, both pointing at code that
was hardened recently — not behaviour bugs, test expectations that
fell out of date.

1. tests/tools/test_kanban_tools.py::test_worker_complete_rejects_stale_run_id
   c002668 ("fix(kanban): add grace period to detect_crashed_workers")
   gates each running task behind a launch-window grace period so
   freshly-spawned workers whose PID isn't yet visible on /proc don't
   get reclaimed. The test creates a worker_env fixture moments before
   asserting reclamation, so the default 30s grace skips the liveness
   check and detect_crashed_workers returns []. Fix: set
   HERMES_KANBAN_CRASH_GRACE_SECONDS=0 in the test so we get the
   immediate-reclaim semantics the assertion expects.

2. tests/tools/test_windows_native_support.py::
     TestKanbanWaitpidWindowsGuard::test_source_gates_waitpid_loop
   ffdc937 ("fix(kanban): hoist zombie reaper out of dispatch_once")
   reshaped reap_worker_zombies to use an early-return Windows guard
   (\`if os.name == "nt": return []\`) instead of an inverted gate
   (\`if os.name != "nt":\`). Both correctly keep the waitpid loop off
   Windows — the early-return form is stronger because the rest of the
   function never runs. Fix: accept either gate pattern in the source
   scan.

Both failures reproduce verbatim on \`origin/main\` in a clean env;
neither relates to in-flight work on NousResearch#33564 (the FD-leak fix). Filing
this as a separate fix-it PR per green-CI-policy so the kanban CI
shard stays green for downstream PRs.
Bryce-huang pushed a commit to wbkunlun/hermes-agent that referenced this pull request May 29, 2026
Two pre-existing test failures on main, both pointing at code that
was hardened recently — not behaviour bugs, test expectations that
fell out of date.

1. tests/tools/test_kanban_tools.py::test_worker_complete_rejects_stale_run_id
   c002668 ("fix(kanban): add grace period to detect_crashed_workers")
   gates each running task behind a launch-window grace period so
   freshly-spawned workers whose PID isn't yet visible on /proc don't
   get reclaimed. The test creates a worker_env fixture moments before
   asserting reclamation, so the default 30s grace skips the liveness
   check and detect_crashed_workers returns []. Fix: set
   HERMES_KANBAN_CRASH_GRACE_SECONDS=0 in the test so we get the
   immediate-reclaim semantics the assertion expects.

2. tests/tools/test_windows_native_support.py::
     TestKanbanWaitpidWindowsGuard::test_source_gates_waitpid_loop
   ffdc937 ("fix(kanban): hoist zombie reaper out of dispatch_once")
   reshaped reap_worker_zombies to use an early-return Windows guard
   (\`if os.name == "nt": return []\`) instead of an inverted gate
   (\`if os.name != "nt":\`). Both correctly keep the waitpid loop off
   Windows — the early-return form is stronger because the rest of the
   function never runs. Fix: accept either gate pattern in the source
   scan.

Both failures reproduce verbatim on \`origin/main\` in a clean env;
neither relates to in-flight work on NousResearch#33564 (the FD-leak fix). Filing
this as a separate fix-it PR per green-CI-policy so the kanban CI
shard stays green for downstream PRs.

#AI commit#
mosaiq-systems pushed a commit to mosaiq-systems/hermes-agent that referenced this pull request May 29, 2026
Two pre-existing test failures on main, both pointing at code that
was hardened recently — not behaviour bugs, test expectations that
fell out of date.

1. tests/tools/test_kanban_tools.py::test_worker_complete_rejects_stale_run_id
   c002668 ("fix(kanban): add grace period to detect_crashed_workers")
   gates each running task behind a launch-window grace period so
   freshly-spawned workers whose PID isn't yet visible on /proc don't
   get reclaimed. The test creates a worker_env fixture moments before
   asserting reclamation, so the default 30s grace skips the liveness
   check and detect_crashed_workers returns []. Fix: set
   HERMES_KANBAN_CRASH_GRACE_SECONDS=0 in the test so we get the
   immediate-reclaim semantics the assertion expects.

2. tests/tools/test_windows_native_support.py::
     TestKanbanWaitpidWindowsGuard::test_source_gates_waitpid_loop
   ffdc937 ("fix(kanban): hoist zombie reaper out of dispatch_once")
   reshaped reap_worker_zombies to use an early-return Windows guard
   (\`if os.name == "nt": return []\`) instead of an inverted gate
   (\`if os.name != "nt":\`). Both correctly keep the waitpid loop off
   Windows — the early-return form is stronger because the rest of the
   function never runs. Fix: accept either gate pattern in the source
   scan.

Both failures reproduce verbatim on \`origin/main\` in a clean env;
neither relates to in-flight work on NousResearch#33564 (the FD-leak fix). Filing
this as a separate fix-it PR per green-CI-policy so the kanban CI
shard stays green for downstream PRs.
KKT-OPT pushed a commit to KKT-OPT/hermes-agent that referenced this pull request May 31, 2026
Two pre-existing test failures on main, both pointing at code that
was hardened recently — not behaviour bugs, test expectations that
fell out of date.

1. tests/tools/test_kanban_tools.py::test_worker_complete_rejects_stale_run_id
   c002668 ("fix(kanban): add grace period to detect_crashed_workers")
   gates each running task behind a launch-window grace period so
   freshly-spawned workers whose PID isn't yet visible on /proc don't
   get reclaimed. The test creates a worker_env fixture moments before
   asserting reclamation, so the default 30s grace skips the liveness
   check and detect_crashed_workers returns []. Fix: set
   HERMES_KANBAN_CRASH_GRACE_SECONDS=0 in the test so we get the
   immediate-reclaim semantics the assertion expects.

2. tests/tools/test_windows_native_support.py::
     TestKanbanWaitpidWindowsGuard::test_source_gates_waitpid_loop
   ffdc937 ("fix(kanban): hoist zombie reaper out of dispatch_once")
   reshaped reap_worker_zombies to use an early-return Windows guard
   (\`if os.name == "nt": return []\`) instead of an inverted gate
   (\`if os.name != "nt":\`). Both correctly keep the waitpid loop off
   Windows — the early-return form is stronger because the rest of the
   function never runs. Fix: accept either gate pattern in the source
   scan.

Both failures reproduce verbatim on \`origin/main\` in a clean env;
neither relates to in-flight work on NousResearch#33564 (the FD-leak fix). Filing
this as a separate fix-it PR per green-CI-policy so the kanban CI
shard stays green for downstream PRs.
gweeteve pushed a commit to gweeteve/hermes-agent that referenced this pull request Jun 2, 2026
Two pre-existing test failures on main, both pointing at code that
was hardened recently — not behaviour bugs, test expectations that
fell out of date.

1. tests/tools/test_kanban_tools.py::test_worker_complete_rejects_stale_run_id
   c002668 ("fix(kanban): add grace period to detect_crashed_workers")
   gates each running task behind a launch-window grace period so
   freshly-spawned workers whose PID isn't yet visible on /proc don't
   get reclaimed. The test creates a worker_env fixture moments before
   asserting reclamation, so the default 30s grace skips the liveness
   check and detect_crashed_workers returns []. Fix: set
   HERMES_KANBAN_CRASH_GRACE_SECONDS=0 in the test so we get the
   immediate-reclaim semantics the assertion expects.

2. tests/tools/test_windows_native_support.py::
     TestKanbanWaitpidWindowsGuard::test_source_gates_waitpid_loop
   ffdc937 ("fix(kanban): hoist zombie reaper out of dispatch_once")
   reshaped reap_worker_zombies to use an early-return Windows guard
   (\`if os.name == "nt": return []\`) instead of an inverted gate
   (\`if os.name != "nt":\`). Both correctly keep the waitpid loop off
   Windows — the early-return form is stronger because the rest of the
   function never runs. Fix: accept either gate pattern in the source
   scan.

Both failures reproduce verbatim on \`origin/main\` in a clean env;
neither relates to in-flight work on NousResearch#33564 (the FD-leak fix). Filing
this as a separate fix-it PR per green-CI-policy so the kanban CI
shard stays green for downstream PRs.
alt-glitch pushed a commit that referenced this pull request Jun 14, 2026
Two pre-existing test failures on main, both pointing at code that
was hardened recently — not behaviour bugs, test expectations that
fell out of date.

1. tests/tools/test_kanban_tools.py::test_worker_complete_rejects_stale_run_id
   196b1fe ("fix(kanban): add grace period to detect_crashed_workers")
   gates each running task behind a launch-window grace period so
   freshly-spawned workers whose PID isn't yet visible on /proc don't
   get reclaimed. The test creates a worker_env fixture moments before
   asserting reclamation, so the default 30s grace skips the liveness
   check and detect_crashed_workers returns []. Fix: set
   HERMES_KANBAN_CRASH_GRACE_SECONDS=0 in the test so we get the
   immediate-reclaim semantics the assertion expects.

2. tests/tools/test_windows_native_support.py::
     TestKanbanWaitpidWindowsGuard::test_source_gates_waitpid_loop
   1d0be33 ("fix(kanban): hoist zombie reaper out of dispatch_once")
   reshaped reap_worker_zombies to use an early-return Windows guard
   (\`if os.name == "nt": return []\`) instead of an inverted gate
   (\`if os.name != "nt":\`). Both correctly keep the waitpid loop off
   Windows — the early-return form is stronger because the rest of the
   function never runs. Fix: accept either gate pattern in the source
   scan.

Both failures reproduce verbatim on \`origin/main\` in a clean env;
neither relates to in-flight work on #33564 (the FD-leak fix). Filing
this as a separate fix-it PR per green-CI-policy so the kanban CI
shard stays green for downstream PRs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/cli CLI entry point, hermes_cli/, setup wizard P3 Low — cosmetic, nice to have type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Kanban plugin: kanban.db file descriptor leak — gateway crashes after ~4 days

2 participants