fix(kanban): close kanban.db FD after every connect() in long-lived processes by teknium1 · Pull Request #33564 · NousResearch/hermes-agent

teknium1 · 2026-05-28T01:07:14Z

Summary

Closes the kanban.db FD leak by adding a real closing context manager and switching every leaky call site in long-lived processes to use it. Closes #33159.

sqlite3.Connection.__exit__ commits/rollbacks but does NOT close the underlying file descriptor. with kb.connect() as conn: in long-lived processes (gateway run_slash, dashboard decompose_task_endpoint) leaks one FD per call. The reporter's gateway died after ~4 days with [Errno 24] Too many open files.

Changes

hermes_cli/kanban_db.py: add connect_closing() — a real @contextlib.contextmanager that wraps connect() with try/finally: conn.close().
hermes_cli/kanban.py (35 sites), hermes_cli/kanban_decompose.py (4), hermes_cli/kanban_specify.py (3): switch with kb.connect(...) as conn: → with kb.connect_closing(...) as conn:.
tests/hermes_cli/test_kanban_db.py: regression coverage — happy-path closure, exception-path closure, smoke test, plus an explicit "bare with kb.connect() still leaks" assertion documenting the upstream sqlite3 behaviour we're working around.

kanban.py is the most impactful because run_slash (called from the gateway for every /kanban slash command) dispatches argparse to those _cmd_* functions in-process — each was leaking one FD per invocation, accumulating until the gateway hit RLIMIT_NOFILE.

Test code under tests/ is intentionally untouched: short-lived processes where OS cleanup masks the leak.

Validation

Surface	Result
`tests/hermes_cli/test_kanban_db.py`	198/198 pass (4 new regression tests)
Kanban suite (15 files)	728/729 pass — 1 pre-existing failure (`test_worker_complete_rejects_stale_run_id`) reproduces unchanged on `origin/main`, unrelated to this PR (caused by `c002668` `add grace period to detect_crashed_workers`)

Closes

[Bug] Kanban plugin: kanban.db file descriptor leak — gateway crashes after ~4 days #33159

Credit

Reporter @kenshinsee correctly identified the leak shape. The specific handler names cited in the report (get_home_channels, decompose_task_endpoint) were not the actual leak sites on current main — get_home_channels already has try/finally: conn.close(), and decompose_task_endpoint doesn't open conn directly. The real leak surface is in kanban.py's _cmd_* handlers (reached via the gateway's /kanban slash dispatch) and the decompose_task / specify modules they call into. Same root cause as reported, wider blast radius than reported.

Infographic

github-actions · 2026-05-28T01:08:06Z

🔎 Lint report: `hermes/hermes-6bc90445` vs `origin/main`

ruff

Total: 0 on HEAD, 0 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 0 pre-existing issues carried over.

ty (type checker)

Total: 9519 on HEAD, 9519 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 5015 pre-existing issues carried over.

Diagnostics are surfaced as warnings — this check never fails the build.

Two pre-existing test failures on main, both pointing at code that was hardened recently — not behaviour bugs, test expectations that fell out of date. 1. tests/tools/test_kanban_tools.py::test_worker_complete_rejects_stale_run_id c002668 ("fix(kanban): add grace period to detect_crashed_workers") gates each running task behind a launch-window grace period so freshly-spawned workers whose PID isn't yet visible on /proc don't get reclaimed. The test creates a worker_env fixture moments before asserting reclamation, so the default 30s grace skips the liveness check and detect_crashed_workers returns []. Fix: set HERMES_KANBAN_CRASH_GRACE_SECONDS=0 in the test so we get the immediate-reclaim semantics the assertion expects. 2. tests/tools/test_windows_native_support.py:: TestKanbanWaitpidWindowsGuard::test_source_gates_waitpid_loop ffdc937 ("fix(kanban): hoist zombie reaper out of dispatch_once") reshaped reap_worker_zombies to use an early-return Windows guard (\`if os.name == "nt": return []\`) instead of an inverted gate (\`if os.name != "nt":\`). Both correctly keep the waitpid loop off Windows — the early-return form is stronger because the rest of the function never runs. Fix: accept either gate pattern in the source scan. Both failures reproduce verbatim on \`origin/main\` in a clean env; neither relates to in-flight work on #33564 (the FD-leak fix). Filing this as a separate fix-it PR per green-CI-policy so the kanban CI shard stays green for downstream PRs.

…rocesses `sqlite3.Connection.__exit__` commits/rollbacks but does NOT close the underlying FD. `with kb.connect() as conn:` in long-lived processes (gateway `run_slash`, dashboard `decompose_task_endpoint`) therefore leaks one FD to `kanban.db` per call. After enough operations the gateway dies with `[Errno 24] Too many open files` (~4 days uptime in the production report — #33159). Fix: add a `connect_closing()` context manager in `hermes_cli/kanban_db` that wraps `connect()` with a real `try/finally: conn.close()`. Switch the 42 leak-prone call sites in `hermes_cli/kanban.py` (35), `hermes_cli/kanban_decompose.py` (4), and `hermes_cli/kanban_specify.py` (3) over to it. `kanban.py` matters because `run_slash` (called from the gateway for every `/kanban` slash command) parses argparse and dispatches to those `_cmd_*` functions in-process — each one was leaking one FD per invocation. Tests inside `tests/` are untouched: short-lived processes where OS cleanup masks the leak. Regression tests added in `test_kanban_db.py` cover both happy-path and exception-path closure, plus an explicit assertion that bare `with kb.connect()` still does NOT close (documenting the upstream sqlite3 behaviour we're working around). Closes #33159.

Two pre-existing test failures on main, both pointing at code that was hardened recently — not behaviour bugs, test expectations that fell out of date. 1. tests/tools/test_kanban_tools.py::test_worker_complete_rejects_stale_run_id c002668 ("fix(kanban): add grace period to detect_crashed_workers") gates each running task behind a launch-window grace period so freshly-spawned workers whose PID isn't yet visible on /proc don't get reclaimed. The test creates a worker_env fixture moments before asserting reclamation, so the default 30s grace skips the liveness check and detect_crashed_workers returns []. Fix: set HERMES_KANBAN_CRASH_GRACE_SECONDS=0 in the test so we get the immediate-reclaim semantics the assertion expects. 2. tests/tools/test_windows_native_support.py:: TestKanbanWaitpidWindowsGuard::test_source_gates_waitpid_loop ffdc937 ("fix(kanban): hoist zombie reaper out of dispatch_once") reshaped reap_worker_zombies to use an early-return Windows guard (\`if os.name == "nt": return []\`) instead of an inverted gate (\`if os.name != "nt":\`). Both correctly keep the waitpid loop off Windows — the early-return form is stronger because the rest of the function never runs. Fix: accept either gate pattern in the source scan. Both failures reproduce verbatim on \`origin/main\` in a clean env; neither relates to in-flight work on NousResearch#33564 (the FD-leak fix). Filing this as a separate fix-it PR per green-CI-policy so the kanban CI shard stays green for downstream PRs.

Two pre-existing test failures on main, both pointing at code that was hardened recently — not behaviour bugs, test expectations that fell out of date. 1. tests/tools/test_kanban_tools.py::test_worker_complete_rejects_stale_run_id c002668 ("fix(kanban): add grace period to detect_crashed_workers") gates each running task behind a launch-window grace period so freshly-spawned workers whose PID isn't yet visible on /proc don't get reclaimed. The test creates a worker_env fixture moments before asserting reclamation, so the default 30s grace skips the liveness check and detect_crashed_workers returns []. Fix: set HERMES_KANBAN_CRASH_GRACE_SECONDS=0 in the test so we get the immediate-reclaim semantics the assertion expects. 2. tests/tools/test_windows_native_support.py:: TestKanbanWaitpidWindowsGuard::test_source_gates_waitpid_loop ffdc937 ("fix(kanban): hoist zombie reaper out of dispatch_once") reshaped reap_worker_zombies to use an early-return Windows guard (\`if os.name == "nt": return []\`) instead of an inverted gate (\`if os.name != "nt":\`). Both correctly keep the waitpid loop off Windows — the early-return form is stronger because the rest of the function never runs. Fix: accept either gate pattern in the source scan. Both failures reproduce verbatim on \`origin/main\` in a clean env; neither relates to in-flight work on NousResearch#33564 (the FD-leak fix). Filing this as a separate fix-it PR per green-CI-policy so the kanban CI shard stays green for downstream PRs. #AI commit#

Two pre-existing test failures on main, both pointing at code that was hardened recently — not behaviour bugs, test expectations that fell out of date. 1. tests/tools/test_kanban_tools.py::test_worker_complete_rejects_stale_run_id c002668 ("fix(kanban): add grace period to detect_crashed_workers") gates each running task behind a launch-window grace period so freshly-spawned workers whose PID isn't yet visible on /proc don't get reclaimed. The test creates a worker_env fixture moments before asserting reclamation, so the default 30s grace skips the liveness check and detect_crashed_workers returns []. Fix: set HERMES_KANBAN_CRASH_GRACE_SECONDS=0 in the test so we get the immediate-reclaim semantics the assertion expects. 2. tests/tools/test_windows_native_support.py:: TestKanbanWaitpidWindowsGuard::test_source_gates_waitpid_loop ffdc937 ("fix(kanban): hoist zombie reaper out of dispatch_once") reshaped reap_worker_zombies to use an early-return Windows guard (\`if os.name == "nt": return []\`) instead of an inverted gate (\`if os.name != "nt":\`). Both correctly keep the waitpid loop off Windows — the early-return form is stronger because the rest of the function never runs. Fix: accept either gate pattern in the source scan. Both failures reproduce verbatim on \`origin/main\` in a clean env; neither relates to in-flight work on NousResearch#33564 (the FD-leak fix). Filing this as a separate fix-it PR per green-CI-policy so the kanban CI shard stays green for downstream PRs.

Two pre-existing test failures on main, both pointing at code that was hardened recently — not behaviour bugs, test expectations that fell out of date. 1. tests/tools/test_kanban_tools.py::test_worker_complete_rejects_stale_run_id 196b1fe ("fix(kanban): add grace period to detect_crashed_workers") gates each running task behind a launch-window grace period so freshly-spawned workers whose PID isn't yet visible on /proc don't get reclaimed. The test creates a worker_env fixture moments before asserting reclamation, so the default 30s grace skips the liveness check and detect_crashed_workers returns []. Fix: set HERMES_KANBAN_CRASH_GRACE_SECONDS=0 in the test so we get the immediate-reclaim semantics the assertion expects. 2. tests/tools/test_windows_native_support.py:: TestKanbanWaitpidWindowsGuard::test_source_gates_waitpid_loop 1d0be33 ("fix(kanban): hoist zombie reaper out of dispatch_once") reshaped reap_worker_zombies to use an early-return Windows guard (\`if os.name == "nt": return []\`) instead of an inverted gate (\`if os.name != "nt":\`). Both correctly keep the waitpid loop off Windows — the early-return form is stronger because the rest of the function never runs. Fix: accept either gate pattern in the source scan. Both failures reproduce verbatim on \`origin/main\` in a clean env; neither relates to in-flight work on #33564 (the FD-leak fix). Filing this as a separate fix-it PR per green-CI-policy so the kanban CI shard stays green for downstream PRs.

teknium1 mentioned this pull request May 28, 2026

test(kanban): align two tests with recent kanban hardening #33572

Merged

yu-xin-c mentioned this pull request May 28, 2026

test(kanban): align crash and waitpid guard tests #33573

Closed

teknium1 force-pushed the hermes/hermes-6bc90445 branch from b7ae6e2 to 0b9cf94 Compare May 28, 2026 01:27

alt-glitch added type/bug Something isn't working comp/cli CLI entry point, hermes_cli/, setup wizard P3 Low — cosmetic, nice to have labels May 28, 2026

alt-glitch mentioned this pull request May 28, 2026

kanban_db.py: connection leak causes 'Too many open files' on macOS (FD exhaustion) #33580

Closed

teknium1 merged commit ebe04c6 into main May 28, 2026
25 checks passed

teknium1 deleted the hermes/hermes-6bc90445 branch May 28, 2026 05:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(kanban): close kanban.db FD after every connect() in long-lived processes#33564

fix(kanban): close kanban.db FD after every connect() in long-lived processes#33564
teknium1 merged 1 commit into
mainfrom
hermes/hermes-6bc90445

teknium1 commented May 28, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 28, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

teknium1 commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Validation

Closes

Credit

Infographic

Uh oh!

github-actions Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔎 Lint report: hermes/hermes-6bc90445 vs origin/main

ruff

ty (type checker)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

teknium1 commented May 28, 2026 •

edited

Loading

github-actions Bot commented May 28, 2026 •

edited

Loading

🔎 Lint report: `hermes/hermes-6bc90445` vs `origin/main`