Skip to content

feat(kanban): gate notifier watcher on dispatch_in_gateway#31964

Closed
steveonjava wants to merge 2 commits into
NousResearch:mainfrom
steveonjava:feat/kanban-notifier-gate-dispatch-in-gateway
Closed

feat(kanban): gate notifier watcher on dispatch_in_gateway#31964
steveonjava wants to merge 2 commits into
NousResearch:mainfrom
steveonjava:feat/kanban-notifier-gate-dispatch-in-gateway

Conversation

@steveonjava

Copy link
Copy Markdown
Contributor

What changes

When you run multiple gateway processes (e.g. one per messaging platform, or a separate dispatch-only process), only the gateway that owns kanban dispatch needs to poll kanban databases for task-completion notifications. This PR adds that gate.

Non-dispatch gateways no longer open kanban DBs for notifier polling. No behavior change for single-gateway setups — the gate is open by default unless kanban.dispatch_in_gateway: false is set in config (or the HERMES_KANBAN_DISPATCH_IN_GATEWAY=false env var is set).

Why you'd notice this

Before this change, running N gateway processes on the same host causes all N processes to open the kanban DB files concurrently. For each open, SQLite's WAL mode writes to the -shm shared-memory file. Under moderate task throughput those concurrent opens compound into reader-count exhaustion inside the -shm file, which surfaces as database corruption errors or I/O errors in the non-dispatch gateways. The corruption only affects non-dispatch processes (they only polled, never wrote), but the error messages look alarming.

After this change the non-dispatch gateways exit the notifier path early, reducing concurrent DB opens to 1.

When to use

  • Single-gateway (default): no action needed. Gate defaults to open.
  • Multi-gateway: set kanban.dispatch_in_gateway: false in each non-dispatch gateway's config, or set HERMES_KANBAN_DISPATCH_IN_GATEWAY=false. This eliminates the concurrent-open problem entirely.

How to Test

# Single-gateway: notifier watcher starts normally
HERMES_KANBAN_DISPATCH_IN_GATEWAY=true python -m pytest tests/gateway/test_kanban_notifier_watcher_dispatch_gate.py -v

# Multi-gateway: non-dispatch gateway skips notifier
HERMES_KANBAN_DISPATCH_IN_GATEWAY=false python -m pytest tests/gateway/test_kanban_notifier_watcher_dispatch_gate.py -v

All 5 notifier-gate tests pass. See docs/kanban/multi-gateway.md for deployment patterns.

Checklist

  • Tests added for new behavior
  • No breaking change for single-gateway deployments
  • Docs updated (docs/kanban/multi-gateway.md)

@alt-glitch alt-glitch added type/feature New feature or request P3 Low — cosmetic, nice to have comp/plugins Plugin system and bundled plugins comp/gateway Gateway runner, session dispatch, delivery labels May 25, 2026
@steveonjava steveonjava marked this pull request as ready for review May 25, 2026 23:32
When dispatch_in_gateway is false (the default), the Kanban notifier
watcher now exits immediately after the initial status sync. This
prevents multi-gateway deployments from running duplicate watchers
when only one gateway is acting as dispatcher.

No behavior change for single-gateway setups (dispatch_in_gateway
defaults to true there).
@steveonjava steveonjava force-pushed the feat/kanban-notifier-gate-dispatch-in-gateway branch from 6f37aa0 to 573e50e Compare May 28, 2026 16:57
@steveonjava

Copy link
Copy Markdown
Contributor Author

Rebased onto current upstream/main (e8b9369). Resolved conflicts: dropped duplicate AUTHOR_MAP entry (upstream already added steve@steveonjava.com and steveonjava@gmail.com in 90b6b3d region) and removed an obsolete inline Mattermost adapter block that conflicted with upstream af973e4's migration of Mattermost to a bundled plugin. The 5-test suite in tests/gateway/test_kanban_notifier_watcher_dispatch_gate.py still passes. PR now down to 3 files (+188), no longer DIRTY.

@steveonjava

Copy link
Copy Markdown
Contributor Author

Heads up on a CI failure in this PR's checks that doesn't look related to the change here:

tests/hermes_cli/test_web_server.py::TestPtyWebSocket::test_resize_escape_is_forwarded is timing out at the 30s pytest-timeout boundary in slice 6/6 (latest failure). The test exercises websocket → PTY TIOCSWINSZ plumbing — fully orthogonal to this PR, which only touches gateway/run.py (kanban notifier gate) plus a new isolated kanban test and docs.

Looks like a known-fragile area. The sibling test TestPtyWebSocket::test_pub_broadcasts_to_events_subscribers was the subject of two recent flake-fix attempts in the same class (#31909 and #31914 by @talwayh1, both closed without merge on 2026-05-25). Same root cause shape: TestClient runs the ASGI app in a background thread, and under CI load on the slice-6 worker (which is the densest slice), the main thread enters receive_bytes() before the ASGI thread processes the frame.

Locally the test passes 60-for-60 in <1s. On upstream/main slice 6 it passes too. My PR adds 5 new tests to tests/gateway/ which shifts the pytest-xdist slice composition just enough to expose the latent race for test_resize_escape_is_forwarded as well.

I don't want to bolt a time.sleep(1.0) yield onto this PR (that approach was already declined twice). Happy to either:

  1. Leave as-is and have a maintainer judge whether this counts as a real blocker
  2. Move the new tests into a different package path to nudge the xdist hash and re-run
  3. Open a separate PR scoped purely to fixing the TestPtyWebSocket flake family — with maintainer guidance on the preferred fix shape

Let me know which route makes sense.

teknium1 pushed a commit that referenced this pull request Jun 2, 2026
Non-dispatch gateways no longer open per-board kanban DBs for notifier
polling. Mirrors the existing dispatcher gate (config
kanban.dispatch_in_gateway, default True; env override
HERMES_KANBAN_DISPATCH_IN_GATEWAY) so multi-gateway setups collapse to a
single process holding kanban.db file descriptors.

Salvaged from PR #31964 by @steveonjava; tests and docs trimmed during
salvage.
@teknium1

teknium1 commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Salvaged onto current main via #37174 with your authorship preserved (rebase-merge). Trimmed during salvage — removed a duplicate test and the redundant ASCII diagram, 188 to 137 LOC. Thanks @steveonjava!

@teknium1 teknium1 closed this Jun 2, 2026
teknium1 pushed a commit that referenced this pull request Jun 2, 2026
Non-dispatch gateways no longer open per-board kanban DBs for notifier
polling. Mirrors the existing dispatcher gate (config
kanban.dispatch_in_gateway, default True; env override
HERMES_KANBAN_DISPATCH_IN_GATEWAY) so multi-gateway setups collapse to a
single process holding kanban.db file descriptors.

Salvaged from PR #31964 by @steveonjava; tests and docs trimmed during
salvage.
changman pushed a commit to changman/hermes-agent that referenced this pull request Jun 10, 2026
Non-dispatch gateways no longer open per-board kanban DBs for notifier
polling. Mirrors the existing dispatcher gate (config
kanban.dispatch_in_gateway, default True; env override
HERMES_KANBAN_DISPATCH_IN_GATEWAY) so multi-gateway setups collapse to a
single process holding kanban.db file descriptors.

Salvaged from PR NousResearch#31964 by @steveonjava; tests and docs trimmed during
salvage.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery comp/plugins Plugin system and bundled plugins P3 Low — cosmetic, nice to have type/feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants