fix(kanban): change synchronous=NORMAL to FULL + add wal_autocheckpoint=100 by someaka · Pull Request #31726 · NousResearch/hermes-agent

someaka · 2026-05-24T23:42:02Z

Problem

PRAGMA synchronous=NORMAL (line 1184 of hermes_cli/kanban_db.py) defers fsync in WAL mode, leaving kanban.db vulnerable to corruption when a process is SIGKILL'd mid-transaction or when WAL frames are partially written during concurrent access. This causes database disk image is malformed errors after ~9-10 rapid task creations.

Root Cause

synchronous=NORMAL means SQLite only syncs at checkpoint boundaries, not after every WAL frame write. If a writer process is killed mid-transaction (SIGKILL from reclaim, gateway shutdown, OOM killer), WAL frames may be written but the main DB is left in an inconsistent state. Next connection: malformed DB.

Fix — two changes

synchronous=NORMAL → FULL: ensures every WAL frame is fsync'd before the write completes, preventing WAL/main-DB inconsistency even after SIGKILL
wal_autocheckpoint=100: caps WAL at 100 pages between automatic checkpoints — bounds the checkpoint I/O spike and reduces the window where a large WAL is fragile

These match the fix proposed in upstream PRs #30969 / #30973 which are not yet merged.

Trade-off

synchronous=FULL adds one fsync per write — ~1ms overhead on SSD, ~5-10ms on HDD. For kanban (create/show/complete operations, not a high-throughput OLTP path), this is negligible vs. the cost of DB corruption and manual recovery.

2 lines, 1 file.

Fixes: #31502
Related: #31618, #30896

… prompts Replace cronjob-based /loop with same-session timer-driven LoopManager (modeled on GoalManager). The current agent in the current session periodically wakes up, executes the prompt, and reports back inline — no separate processes, no popups, full context continuity. Changes: - hermes_cli/loop.py: LoopState dataclass + LoopManager class with SessionDB persistence - hermes_cli/commands.py: /loop CommandDef → Session category, no alias, no subcommands - cli.py: _get_loop_manager, _handle_loop_command rewritten, _maybe_continue_loop_after_turn hook - gateway/run.py: _get_loop_manager_for_event, _handle_loop_command rewritten, _post_turn_loop_continuation, _is_loop_continuation_event, _clear_loop_pending_continuations, _loop_still_active_for_session, _send_loop_status_notice, _defer_loop_status_notice_after_delivery - Continuation prompt format: '[Loop check] {prompt}' (distinguishable from goal continuation) - Goal takes priority over loop when both active - Interrupt guard: auto-pauses on Ctrl+C - Empty response guard: skips on empty responses Tests: 47/47 core pass (25 LoopManager + 22 gateway). 13 CLI tests have pre-existing pytest-xdist/prompt_toolkit stdout capture issue — pass individually. Kanban: 8-phase pipeline complete (investigate-A/B/C, plan, verify-plan, implement, verify-impl, final-verify)

… prompts

Changes: - hermes_cli/loop.py: Add MIN_INTERVAL_SECONDS=60, clamp interval, remove [Loop check] prefix and on_message popup - cli.py: Remove [Loop check] from kickoff enqueue - gateway/run.py: Replace no-op _post_turn_loop_continuation with real timer logic, add daemon ticker thread - tests: Updated test expectations for new behaviour Known gap: _dispatch_loop_prompt needed for TUI path — see docs/plans/loop-fix-plan.md for Tasks 1-3

…tion

… TUI gateway - cli.py: skip LoopScheduler spawn + immediate kickoff when HERMES_SLASH_WORKER or HERMES_SESSION_KEY env is set (slash_worker is ephemeral — any thread it spawns dies with the process) - tui_gateway/server.py: add _ensure_loop_ticker() — per-session daemon thread that re-reads loop state from SessionDB each second, fires immediately on first tick, and handles session-key rotation from compression; wired after agent build and after each turn

…ch callback LoopScheduler now takes a single 'dispatch(prompt) -> bool' callback instead of queue/is_idle/on_message plumbing. Both CLI and TUI paths use the SAME class — the only difference is what dispatch does: CLI: dispatch puts prompt into _pending_input queue TUI: dispatch calls _run_prompt_submit() to inject Slash_worker: LoopManager(dispatch=None) — persist only, no scheduler Changes: - hermes_cli/loop.py: refactor LoopScheduler (+del _state, +dispatch), add dispatch param to LoopManager, auto-start on set/resume, add delete() and _del_loop_meta() + delete_loop() convenience helper - cli.py: collapse 3x _is_slash_worker checks into 1 (in _get_loop_manager), add _make_loop_dispatch(), simplify /loop handlers - tui_gateway/server.py: delete _ensure_loop_ticker() (~65 lines), replace with _make_tui_dispatch() + LoopManager in session dict, remove post-turn re-ensure (LoopManager handles own lifecycle) Net: -45 lines, zero duplicated tick logic, clean delete API.

Two import-time crashes fixed: - cli.py: add Callable to typing imports (needed by _make_loop_dispatch returning Callable[[str], bool]) - tui_gateway/server.py: add Callable to typing imports (needed by _make_tui_dispatch signature), fix key variable shadowed in _build() closure — assigning 'key' inside _build() made Python treat all earlier key reads as local, causing UnboundLocalError

The TUI gateway creates LoopManager before the slash_worker persists the loop to SessionDB. Previously the scheduler only started when a loop already existed, so the gateway's ticker slept forever. Two changes: 1. LoopManager.__init__: always start scheduler when dispatch is set, not only when a persisted active loop exists 2. LoopScheduler._tick: when load_loop() returns None, just keep polling instead of setting _running=False — the slash_worker may persist the loop on the very next tick

…ew loop Each /loop [interval] <prompt> creates a new loop with a 6-char UUID. No names, no overwriting — pure UIDs like #a3f1c2. List, pause, resume, and clear target loops by UID. Storage: loop:<sid>:<uid> keys + loop:<sid>:__ids__ registry. All name-based logic removed — parser, manager, and both handlers (CLI + TUI gateway) now use uid-based interface.

Each active loop now shows time remaining until next dispatch. Examples: next: now, next: 42s, next: 3m 15s, next: 1h 23m. Computed from last_fired_at + interval_seconds vs time.time().

Bug 1 — Double fire on /loop create: The TUI handler returned {type: send, message: prompt} which immediately submitted the prompt as a user message. The scheduler then also fired on next tick (last_fired_at=0). Changed to {type: exec, output: notice} so only the notice shows; scheduler handles the actual first dispatch. Bug 2 — Countdown always shows 'now': LoopManager._states was never refreshed after scheduler wrote last_fired_at to SessionDB. status_line() now calls load_all_loops() to pull fresh state before rendering countdown.

- plugins/memory/__init__.py: restore get_active_memory_providers() supporting memory.providers list (multi) and memory.provider (legacy) - agent/memory_manager.py: restore multi-provider add_provider(), remove_provider(), thread safety, schema-before-mutation safety - agent/agent_init.py: loop over all active providers instead of loading a single one - tests: update monkeypatches for renamed function + list returns

…n, process_registry — fork sync May 2026

Extends the previous commit to cover the remaining additive-column index that sits on the same migration trap: - ``task_events.run_id`` -> ``idx_events_run`` was still in SCHEMA_SQL. A legacy ``task_events`` table predating NousResearch#17805 (no ``run_id``) would still abort ``executescript`` before ``_migrate_add_optional_columns`` could add the column. Hoisted out of SCHEMA_SQL and made unconditional in the migration alongside the other three indexes. - Removed the now-redundant ``CREATE INDEX idx_tasks_idempotency`` that was nested inside the ``if "idempotency_key" not in cols`` branch. The unconditional create lower in the function makes it idempotent on both fresh and legacy DBs. - Strengthened the regression test to cover all four indexes (``idx_tasks_session_id``, ``idx_tasks_tenant``, ``idx_tasks_idempotency``, ``idx_events_run``) and to seed a pre-NousResearch#17805 ``task_events`` shape that exercises the ``run_id`` migration path. The result: every ``CREATE INDEX`` that depends on an additive column now runs after the migration ensures the column exists. Verified against a realistic pre-NousResearch#16081 board fixture (tasks + task_events both legacy shape) — origin/main reproduces ``no such column: session_id``; this branch migrates cleanly and creates all four indexes.

…am merge The merge of upstream/main (9faa3cc89) brought in the call site (agent_init.py) and tests but missed the ContextCompressor.__init__ signature change, causing ~528 test failures with: TypeError: ContextCompressor.__init__() got an unexpected keyword argument 'abort_on_summary_failure' Adds the missing parameter and the abort-on-failure logic block in compress() to match upstream at 9aae59f.

Four categories of fixes for fork additions clobbered by upstream merge: 1. gateway/run.py — 3-way surgical insert of loop dispatch methods (_dispatch_loop_prompt, _is_loop_continuation_event, _clear_loop_pending_continuations, _loop_still_active_for_session, _get_loop_manager_for_event, _send_loop_status_notice, _defer_loop_status_notice_after_delivery, _post_turn_loop_continuation) into upstream base preserving all upstream additions (noise filter, reconnect, bundles, platform commands, probe_audio, telegram topic mode) 2. tools/file_operations.py — _check_git_baseline() method + warning wiring in ShellFileOperations.write_file 3. hermes_cli/loop.py — save_loop/load_loop public aliases 4. agent/context_compressor.py — abort_on_summary_failure param + _last_compress_aborted logic (cherry-pick c22770b2e) All fixes verified with Python syntax check + attribute presence.

…add missing blank lines between loop methods

The /loop command was registered under 'Tools & Skills' category but the test test_loop_command_category expects it in 'Session' (alongside /background, /queue, /steer, etc.). Moving to Session is semantically correct since /loop is a per-session recurring prompt runner, similar to /background and /queue.

…ler guard; run_agent.py config delegation to agent_init - _load_loop: assign state.id=uid when deserialized id is None (old JSON compat) - LoopManager.set(): skip fork-pattern scheduler when dispatch scheduler active - run_agent.py: remove 447 lines of compression/skills/tool-use config now in agent_init.py - All 31 loop_manager tests + 23 run_agent regression tests pass

- Remove pre-drain mark_resume_pending block (NousResearch#27856 follow-up): The upstream pre-drain block pre-marked ALL running sessions before drain, then tried to clear on clean drain. This caused mark_resume_pending to fire on clean drains (violating test contract) and to mark sessions that finished during the drain window (stray resume notes). SIGKILL safety is preserved: if killed mid-drain, .clean_shutdown is not written, so suspend_recently_active() on next boot marks sessions as resume_pending with reason='restart_interrupted' which is in _AUTO_RESUME_REASONS — functionally identical auto-resume. - Verified post-timeout block only marks sessions still in _running_agents (not drain-start snapshot), skips _AGENT_PENDING_SENTINEL, uses correct reason strings — all matching test expectations. - Verified skipif fix: bool(os.getenv('CI')) correctly returns Python bool instead of string, avoiding pytest eval('true') NameError. - All 72 tests pass (67 restart-resume + 5 memory-perf).

The test_alias_resolves_to_canonical test expects codex_runtime to resolve to codex-runtime (for Telegram platform underscore-only naming), but the CommandDef had no underscore alias. This matches the existing pattern set by reload-mcp → reload_mcp. All 1679 hermes_cli tests pass (after fixed 1 previously-existing failure). Pre-existing failures in test_gateway_wsl/test_doctor are env/platform issues not introduced by worker changes.

… mode

…edrock extra, kanban skills/notify, gateway stdin capture - hermes_cli/doctor.py: move codex CLI hint into the Codex auth block with proper gating (index under provider section, suppress when codex already installed) - hermes_cli/commands.py [already committed]: quit CommandDef gets args_hint='[--delete]' - gateway/config.py: add Google Chat to _PLATFORM_CONNECTED_CHECKERS - pyproject.toml: exclude bedrock extras from [all] (boto3 is heavy) - hermes_cli/kanban.py: remove CLI auto-subscribe in _cmd_create to avoid double-sub with gateway handler - tests/hermes_cli/test_kanban_core_functionality.py: kanban_home fixture creates kanban-worker skill dir - tests/hermes_cli/test_gateway_service.py: mock prompt_yes_no + fix systemd_install lambda + stub systemd_start - tests/hermes_cli/test_gateway_wsl.py: mock prompt_yes_no + stub systemd_start Co-authored-by: inspector t_573b52d9

…ag clearing

…andler - _cmd_create: restore CLI auto-subscribe (removed in c0f1a2ce4), gated on new _gateway_kanban_in_progress module-level flag - _handle_kanban_command: set flag before run_slash, clear in finally - test: gateway test sets _HERMES_GATEWAY=1 to match production env This fixes test_cli_create_auto_subscribes without re-breaking test_gateway_create_autosubscribes_on_explicit_board.

…failure The original broad except Exception silently swallowed all setup errors including partial state where sys.stdout was wrapped but state['installed'] was never set to True. This caused isinstance(sys.stdout, _UpdateOutputStream) to fail in CI when pytest-xdist worker forking triggered an import-path mismatch mid-setup. Fix: restore prev_stdout/prev_stderr in the except handler if the streams were partially replaced but the install didn't complete, so the test's isinstance check always sees consistent state regardless of where the failure occurred within the try block.

…oss tests The _cleanup_envs finalizer in the vercel sandbox test fixture was calling env.cleanup() which writes snapshot metadata to the snapshot store. This leaked {task-123: snap_default} across test boundaries on Python 3.12 CI + xdist, causing test_persistent_cleanup_without_task_id to fail with AssertionError: {'task-123': 'snap_default'} == {}. Fix: set env._persistent = False in the finalizer so cleanup doesn't write test-observable snapshot state. The tests themselves already call cleanup() and verify snapshot state explicitly. Fixes t_b8fcdc5c

… dependency

…cess_notification The pre-input drain in process_loop() called _format_process_notification(evt) which does not exist anywhere in the codebase. The correct name is format_process_notification from tools.process_registry. The bare except around this block silently swallowed the NameError, causing all completion_queue notifications to be silently dropped in the pre-input check path. Idle and post-turn drains (which use drain_notifications() correctly) were unaffected.

Adds zero-polling kanban→TUI notification delivery using a Unix FIFO: - kanban_db.py: _append_event now writes a JSON line to ~/.hermes/tui_kanban.fifo when notification-worthy events are written (completed, blocked, gave_up, crashed, timed_out). Best-effort — silently skipped if FIFO doesn't exist. - tui_gateway/server.py: Creates the FIFO at module level and starts a per-session daemon reader thread. The thread's open() blocks at the OS level — zero polling, zero CPU when idle. On data arrival, queries kanban DB for CLI-subscribed tasks, formats via _format_kanban_notification, emits status.update for TUI visibility, and injects as agent turn via _run_prompt_submit. - FIFO cleaned up on gateway exit via atexit handler so stale FIFOs never block future startup.

27 tests covering the zero-polling kanban→TUI FIFO notification bridge: Writer side (kanban_db._append_event): - All 5 notify kinds (completed/blocked/crashed/timed_out/gave_up) write JSON to FIFO - Non-notify kinds (created/heartbeat) do NOT write - Missing FIFO does not raise - Multiple events produce multiple lines Reader side (tui_gateway.server._format_kanban_notification): - All 5 event kinds format correctly with/without payload - Unknown kind returns None - Summary/reason truncation (200/160 chars) - Dict-based event access (DB row format) Lifecycle: - FIFO created on module import - atexit cleanup unlinks FIFO End-to-end: - Complete task → FIFO → reader receives correct JSON - Non-notify events invisible to reader

…TUI dedup Bug B: poll() was marking completions as consumed, which prevented the TUI notification poller from delivering notify_on_complete notifications. - Remove self._completion_consumed.add(session_id) from poll() - Add mark_completion_consumed() method to ProcessRegistry - Call mark_completion_consumed() from TUI poller after successful delivery This restores notification delivery while preventing duplicates via explicit TUI-side consumption tracking.

…comments Bug A (cli.py typo) and Bug B (poll() marking consumed) were already fixed in prior commits. This cleans up the remaining stale docstring and comments that still referenced 'wait/poll/log' — poll() no longer marks completions as consumed.

…r dispatch Adds test_notification_poller_marks_consumed_after_dispatch to ensure the TUI notification poller marks completions as consumed after successfully dispatching them, preventing duplicate notifications.

…on paths

…pi_server SSE

…otifications Add _session_to_run reverse mapping (session_key -> run_id) to APIServerAdapter. Add push_process_event() method that looks up the run's SSE queue by session_key. Fix gateway watcher call signature to pass event dict with session_key embedded. Root cause: gateway's _run_process_watcher tried adapter.handle_message() for api_server platform, but api_server is an HTTP/SSE handler, not a gateway adapter. Notification was silently dropped with 'no routing metadata'. Now the gateway pushes process.completed events directly to the api_server's SSE queue when platform is api_server.

…otifications Add _session_to_run reverse mapping and push_process_event() to APIServerAdapter. Delete duplicate old push_process_event stub. Fix gateway watcher to pass event dict with session_key embedded. Root cause: gateway _run_process_watcher tried adapter.handle_message() for api_server but api_server is HTTP/SSE, not a gateway adapter. Notification silently dropped with 'no routing metadata' warning.

… not just HEAD~10

…mpliance

…nt=100 PRAGMA synchronous=NORMAL defers fsync in WAL mode, leaving kanban.db vulnerable to corruption when a process is SIGKILL'd mid-transaction or when WAL frames are partially written during concurrent access. Two changes: - synchronous=NORMAL → FULL: ensures every WAL frame is fsync'd before the write completes, preventing WAL/main-DB inconsistency - wal_autocheckpoint=100: limits WAL growth to 100 pages between automatic checkpoints, bounding checkpoint I/O and reducing the window where a SIGKILL can leave a large WAL in a fragile state Together these match the fix proposed in upstream PRs NousResearch#30969 / NousResearch#30973. Fixes: NousResearch#31502

alt-glitch · 2026-05-25T00:00:14Z

Superseded by #31731 (clean replacement — same author, focused 1-file diff vs stacked branch).

Ed and others added 30 commits May 25, 2026 01:35

fix(loop): rewrite /loop as same-session heartbeat using continuation…

b1bb388

… prompts

feat(gateway): add _dispatch_loop_prompt for adapter-free loop injection

c358498

fix(gateway): add daemon loop ticker with _dispatch_loop_prompt injec…

f278d6b

…tion

fix(gateway): use _dispatch_loop_prompt in post-turn loop continuation

961f590

test: update loop tests for _dispatch_loop_prompt and no on_message

4e019e8

feat(loop): live countdown — next tick time in /loop list/status

332fcf7

Each active loop now shows time remaining until next dispatch. Examples: next: now, next: 42s, next: 3m 15s, next: 1h 23m. Computed from last_fired_at + interval_seconds vs time.time().

feat: multi-provider memory, kanban, holographic, mem0, ha integratio…

056b764

…n, process_registry — fork sync May 2026

fix(kanban): migrate task session index after columns

52e3208

fix(gateway): restore @staticmethod on _is_loop_continuation_event + …

550ded5

…add missing blank lines between loop methods

fix(gateway): wire _handle_loop_command dispatch for /loop in gateway…

ebbe55c

… mode

fix(compress): add _last_compress_aborted attr, force kwarg, abort fl…

6819127

…ag clearing

Radical Edward and others added 22 commits May 25, 2026 01:35

fix: remove '-n 0' from integration test command — pytest-xdist not a…

4b83aa0

… dependency

chore: refresh uv.lock after upstream rebase (4 new kanban commits)

94d1fd4

docs: add TUI notification pipeline investigation report

510eece

fix(webui): propagate contextvars for background process notifications

3f3f0ca

fix(webui): propagate contextvars for background process notifications

4878b59

test(notifications): add regression guard coverage for all notificati…

0cc3537

…on paths

fix(webui): propagate contextvars for background process notifications

c6d54fa

test(webui): add notification delivery tests for contextvar propagation

e303b76

fix(tests): address verification findings for regression guards

f271fa3

fix(gateway): deliver background process notifications to WebUI via a…

6a37df0

…pi_server SSE

fix(tests): make git history guard test robust — search all branches,…

e00b103

… not just HEAD~10

fix(kanban): add encoding=utf-8 to FIFO open() for Windows footgun co…

f255d30

…mpliance

someaka requested a review from a team May 24, 2026 23:42

someaka closed this May 24, 2026

someaka deleted the fix/31502-kanban-synchronous-full branch May 24, 2026 23:51

alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/cli CLI entry point, hermes_cli/, setup wizard labels May 24, 2026

alt-glitch mentioned this pull request May 25, 2026

fix(kanban): change synchronous=NORMAL to FULL + add wal_autocheckpoint=100 #31731

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(kanban): change synchronous=NORMAL to FULL + add wal_autocheckpoint=100#31726

fix(kanban): change synchronous=NORMAL to FULL + add wal_autocheckpoint=100#31726
someaka wants to merge 55 commits into
NousResearch:mainfrom
someaka:fix/31502-kanban-synchronous-full

someaka commented May 24, 2026

Uh oh!

alt-glitch commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

someaka commented May 24, 2026

Problem

Root Cause

Fix — two changes

Trade-off

Uh oh!

alt-glitch commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants