fix(kanban): change synchronous=NORMAL to FULL + add wal_autocheckpoint=100#31726
Closed
someaka wants to merge 55 commits into
Closed
fix(kanban): change synchronous=NORMAL to FULL + add wal_autocheckpoint=100#31726someaka wants to merge 55 commits into
someaka wants to merge 55 commits into
Conversation
… prompts
Replace cronjob-based /loop with same-session timer-driven LoopManager (modeled on GoalManager).
The current agent in the current session periodically wakes up, executes the prompt, and reports back
inline — no separate processes, no popups, full context continuity.
Changes:
- hermes_cli/loop.py: LoopState dataclass + LoopManager class with SessionDB persistence
- hermes_cli/commands.py: /loop CommandDef → Session category, no alias, no subcommands
- cli.py: _get_loop_manager, _handle_loop_command rewritten, _maybe_continue_loop_after_turn hook
- gateway/run.py: _get_loop_manager_for_event, _handle_loop_command rewritten,
_post_turn_loop_continuation, _is_loop_continuation_event, _clear_loop_pending_continuations,
_loop_still_active_for_session, _send_loop_status_notice, _defer_loop_status_notice_after_delivery
- Continuation prompt format: '[Loop check] {prompt}' (distinguishable from goal continuation)
- Goal takes priority over loop when both active
- Interrupt guard: auto-pauses on Ctrl+C
- Empty response guard: skips on empty responses
Tests: 47/47 core pass (25 LoopManager + 22 gateway). 13 CLI tests have pre-existing
pytest-xdist/prompt_toolkit stdout capture issue — pass individually.
Kanban: 8-phase pipeline complete (investigate-A/B/C, plan, verify-plan, implement, verify-impl, final-verify)
Changes: - hermes_cli/loop.py: Add MIN_INTERVAL_SECONDS=60, clamp interval, remove [Loop check] prefix and on_message popup - cli.py: Remove [Loop check] from kickoff enqueue - gateway/run.py: Replace no-op _post_turn_loop_continuation with real timer logic, add daemon ticker thread - tests: Updated test expectations for new behaviour Known gap: _dispatch_loop_prompt needed for TUI path — see docs/plans/loop-fix-plan.md for Tasks 1-3
… TUI gateway - cli.py: skip LoopScheduler spawn + immediate kickoff when HERMES_SLASH_WORKER or HERMES_SESSION_KEY env is set (slash_worker is ephemeral — any thread it spawns dies with the process) - tui_gateway/server.py: add _ensure_loop_ticker() — per-session daemon thread that re-reads loop state from SessionDB each second, fires immediately on first tick, and handles session-key rotation from compression; wired after agent build and after each turn
…ch callback LoopScheduler now takes a single 'dispatch(prompt) -> bool' callback instead of queue/is_idle/on_message plumbing. Both CLI and TUI paths use the SAME class — the only difference is what dispatch does: CLI: dispatch puts prompt into _pending_input queue TUI: dispatch calls _run_prompt_submit() to inject Slash_worker: LoopManager(dispatch=None) — persist only, no scheduler Changes: - hermes_cli/loop.py: refactor LoopScheduler (+del _state, +dispatch), add dispatch param to LoopManager, auto-start on set/resume, add delete() and _del_loop_meta() + delete_loop() convenience helper - cli.py: collapse 3x _is_slash_worker checks into 1 (in _get_loop_manager), add _make_loop_dispatch(), simplify /loop handlers - tui_gateway/server.py: delete _ensure_loop_ticker() (~65 lines), replace with _make_tui_dispatch() + LoopManager in session dict, remove post-turn re-ensure (LoopManager handles own lifecycle) Net: -45 lines, zero duplicated tick logic, clean delete API.
Two import-time crashes fixed: - cli.py: add Callable to typing imports (needed by _make_loop_dispatch returning Callable[[str], bool]) - tui_gateway/server.py: add Callable to typing imports (needed by _make_tui_dispatch signature), fix key variable shadowed in _build() closure — assigning 'key' inside _build() made Python treat all earlier key reads as local, causing UnboundLocalError
The TUI gateway creates LoopManager before the slash_worker persists the loop to SessionDB. Previously the scheduler only started when a loop already existed, so the gateway's ticker slept forever. Two changes: 1. LoopManager.__init__: always start scheduler when dispatch is set, not only when a persisted active loop exists 2. LoopScheduler._tick: when load_loop() returns None, just keep polling instead of setting _running=False — the slash_worker may persist the loop on the very next tick
…ew loop Each /loop [interval] <prompt> creates a new loop with a 6-char UUID. No names, no overwriting — pure UIDs like #a3f1c2. List, pause, resume, and clear target loops by UID. Storage: loop:<sid>:<uid> keys + loop:<sid>:__ids__ registry. All name-based logic removed — parser, manager, and both handlers (CLI + TUI gateway) now use uid-based interface.
Each active loop now shows time remaining until next dispatch. Examples: next: now, next: 42s, next: 3m 15s, next: 1h 23m. Computed from last_fired_at + interval_seconds vs time.time().
Bug 1 — Double fire on /loop create:
The TUI handler returned {type: send, message: prompt} which
immediately submitted the prompt as a user message. The scheduler
then also fired on next tick (last_fired_at=0). Changed to
{type: exec, output: notice} so only the notice shows; scheduler
handles the actual first dispatch.
Bug 2 — Countdown always shows 'now':
LoopManager._states was never refreshed after scheduler wrote
last_fired_at to SessionDB. status_line() now calls load_all_loops()
to pull fresh state before rendering countdown.
- plugins/memory/__init__.py: restore get_active_memory_providers() supporting memory.providers list (multi) and memory.provider (legacy) - agent/memory_manager.py: restore multi-provider add_provider(), remove_provider(), thread safety, schema-before-mutation safety - agent/agent_init.py: loop over all active providers instead of loading a single one - tests: update monkeypatches for renamed function + list returns
…n, process_registry — fork sync May 2026
Extends the previous commit to cover the remaining additive-column index that sits on the same migration trap: - ``task_events.run_id`` -> ``idx_events_run`` was still in SCHEMA_SQL. A legacy ``task_events`` table predating NousResearch#17805 (no ``run_id``) would still abort ``executescript`` before ``_migrate_add_optional_columns`` could add the column. Hoisted out of SCHEMA_SQL and made unconditional in the migration alongside the other three indexes. - Removed the now-redundant ``CREATE INDEX idx_tasks_idempotency`` that was nested inside the ``if "idempotency_key" not in cols`` branch. The unconditional create lower in the function makes it idempotent on both fresh and legacy DBs. - Strengthened the regression test to cover all four indexes (``idx_tasks_session_id``, ``idx_tasks_tenant``, ``idx_tasks_idempotency``, ``idx_events_run``) and to seed a pre-NousResearch#17805 ``task_events`` shape that exercises the ``run_id`` migration path. The result: every ``CREATE INDEX`` that depends on an additive column now runs after the migration ensures the column exists. Verified against a realistic pre-NousResearch#16081 board fixture (tasks + task_events both legacy shape) — origin/main reproduces ``no such column: session_id``; this branch migrates cleanly and creates all four indexes.
…am merge The merge of upstream/main (9faa3cc89) brought in the call site (agent_init.py) and tests but missed the ContextCompressor.__init__ signature change, causing ~528 test failures with: TypeError: ContextCompressor.__init__() got an unexpected keyword argument 'abort_on_summary_failure' Adds the missing parameter and the abort-on-failure logic block in compress() to match upstream at 9aae59f.
Four categories of fixes for fork additions clobbered by upstream merge: 1. gateway/run.py — 3-way surgical insert of loop dispatch methods (_dispatch_loop_prompt, _is_loop_continuation_event, _clear_loop_pending_continuations, _loop_still_active_for_session, _get_loop_manager_for_event, _send_loop_status_notice, _defer_loop_status_notice_after_delivery, _post_turn_loop_continuation) into upstream base preserving all upstream additions (noise filter, reconnect, bundles, platform commands, probe_audio, telegram topic mode) 2. tools/file_operations.py — _check_git_baseline() method + warning wiring in ShellFileOperations.write_file 3. hermes_cli/loop.py — save_loop/load_loop public aliases 4. agent/context_compressor.py — abort_on_summary_failure param + _last_compress_aborted logic (cherry-pick c22770b2e) All fixes verified with Python syntax check + attribute presence.
…add missing blank lines between loop methods
The /loop command was registered under 'Tools & Skills' category but the test test_loop_command_category expects it in 'Session' (alongside /background, /queue, /steer, etc.). Moving to Session is semantically correct since /loop is a per-session recurring prompt runner, similar to /background and /queue.
…ler guard; run_agent.py config delegation to agent_init - _load_loop: assign state.id=uid when deserialized id is None (old JSON compat) - LoopManager.set(): skip fork-pattern scheduler when dispatch scheduler active - run_agent.py: remove 447 lines of compression/skills/tool-use config now in agent_init.py - All 31 loop_manager tests + 23 run_agent regression tests pass
- Remove pre-drain mark_resume_pending block (NousResearch#27856 follow-up): The upstream pre-drain block pre-marked ALL running sessions before drain, then tried to clear on clean drain. This caused mark_resume_pending to fire on clean drains (violating test contract) and to mark sessions that finished during the drain window (stray resume notes). SIGKILL safety is preserved: if killed mid-drain, .clean_shutdown is not written, so suspend_recently_active() on next boot marks sessions as resume_pending with reason='restart_interrupted' which is in _AUTO_RESUME_REASONS — functionally identical auto-resume. - Verified post-timeout block only marks sessions still in _running_agents (not drain-start snapshot), skips _AGENT_PENDING_SENTINEL, uses correct reason strings — all matching test expectations. - Verified skipif fix: bool(os.getenv('CI')) correctly returns Python bool instead of string, avoiding pytest eval('true') NameError. - All 72 tests pass (67 restart-resume + 5 memory-perf).
The test_alias_resolves_to_canonical test expects codex_runtime to resolve to codex-runtime (for Telegram platform underscore-only naming), but the CommandDef had no underscore alias. This matches the existing pattern set by reload-mcp → reload_mcp. All 1679 hermes_cli tests pass (after fixed 1 previously-existing failure). Pre-existing failures in test_gateway_wsl/test_doctor are env/platform issues not introduced by worker changes.
…edrock extra, kanban skills/notify, gateway stdin capture - hermes_cli/doctor.py: move codex CLI hint into the Codex auth block with proper gating (index under provider section, suppress when codex already installed) - hermes_cli/commands.py [already committed]: quit CommandDef gets args_hint='[--delete]' - gateway/config.py: add Google Chat to _PLATFORM_CONNECTED_CHECKERS - pyproject.toml: exclude bedrock extras from [all] (boto3 is heavy) - hermes_cli/kanban.py: remove CLI auto-subscribe in _cmd_create to avoid double-sub with gateway handler - tests/hermes_cli/test_kanban_core_functionality.py: kanban_home fixture creates kanban-worker skill dir - tests/hermes_cli/test_gateway_service.py: mock prompt_yes_no + fix systemd_install lambda + stub systemd_start - tests/hermes_cli/test_gateway_wsl.py: mock prompt_yes_no + stub systemd_start Co-authored-by: inspector t_573b52d9
…andler - _cmd_create: restore CLI auto-subscribe (removed in c0f1a2ce4), gated on new _gateway_kanban_in_progress module-level flag - _handle_kanban_command: set flag before run_slash, clear in finally - test: gateway test sets _HERMES_GATEWAY=1 to match production env This fixes test_cli_create_auto_subscribes without re-breaking test_gateway_create_autosubscribes_on_explicit_board.
…failure The original broad except Exception silently swallowed all setup errors including partial state where sys.stdout was wrapped but state['installed'] was never set to True. This caused isinstance(sys.stdout, _UpdateOutputStream) to fail in CI when pytest-xdist worker forking triggered an import-path mismatch mid-setup. Fix: restore prev_stdout/prev_stderr in the except handler if the streams were partially replaced but the install didn't complete, so the test's isinstance check always sees consistent state regardless of where the failure occurred within the try block.
…oss tests
The _cleanup_envs finalizer in the vercel sandbox test fixture was
calling env.cleanup() which writes snapshot metadata to the snapshot
store. This leaked {task-123: snap_default} across test boundaries on
Python 3.12 CI + xdist, causing test_persistent_cleanup_without_task_id
to fail with AssertionError: {'task-123': 'snap_default'} == {}.
Fix: set env._persistent = False in the finalizer so cleanup doesn't
write test-observable snapshot state. The tests themselves already call
cleanup() and verify snapshot state explicitly.
Fixes t_b8fcdc5c
…cess_notification The pre-input drain in process_loop() called _format_process_notification(evt) which does not exist anywhere in the codebase. The correct name is format_process_notification from tools.process_registry. The bare except around this block silently swallowed the NameError, causing all completion_queue notifications to be silently dropped in the pre-input check path. Idle and post-turn drains (which use drain_notifications() correctly) were unaffected.
Adds zero-polling kanban→TUI notification delivery using a Unix FIFO: - kanban_db.py: _append_event now writes a JSON line to ~/.hermes/tui_kanban.fifo when notification-worthy events are written (completed, blocked, gave_up, crashed, timed_out). Best-effort — silently skipped if FIFO doesn't exist. - tui_gateway/server.py: Creates the FIFO at module level and starts a per-session daemon reader thread. The thread's open() blocks at the OS level — zero polling, zero CPU when idle. On data arrival, queries kanban DB for CLI-subscribed tasks, formats via _format_kanban_notification, emits status.update for TUI visibility, and injects as agent turn via _run_prompt_submit. - FIFO cleaned up on gateway exit via atexit handler so stale FIFOs never block future startup.
27 tests covering the zero-polling kanban→TUI FIFO notification bridge: Writer side (kanban_db._append_event): - All 5 notify kinds (completed/blocked/crashed/timed_out/gave_up) write JSON to FIFO - Non-notify kinds (created/heartbeat) do NOT write - Missing FIFO does not raise - Multiple events produce multiple lines Reader side (tui_gateway.server._format_kanban_notification): - All 5 event kinds format correctly with/without payload - Unknown kind returns None - Summary/reason truncation (200/160 chars) - Dict-based event access (DB row format) Lifecycle: - FIFO created on module import - atexit cleanup unlinks FIFO End-to-end: - Complete task → FIFO → reader receives correct JSON - Non-notify events invisible to reader
…TUI dedup Bug B: poll() was marking completions as consumed, which prevented the TUI notification poller from delivering notify_on_complete notifications. - Remove self._completion_consumed.add(session_id) from poll() - Add mark_completion_consumed() method to ProcessRegistry - Call mark_completion_consumed() from TUI poller after successful delivery This restores notification delivery while preventing duplicates via explicit TUI-side consumption tracking.
…comments Bug A (cli.py typo) and Bug B (poll() marking consumed) were already fixed in prior commits. This cleans up the remaining stale docstring and comments that still referenced 'wait/poll/log' — poll() no longer marks completions as consumed.
…r dispatch Adds test_notification_poller_marks_consumed_after_dispatch to ensure the TUI notification poller marks completions as consumed after successfully dispatching them, preventing duplicate notifications.
…otifications Add _session_to_run reverse mapping (session_key -> run_id) to APIServerAdapter. Add push_process_event() method that looks up the run's SSE queue by session_key. Fix gateway watcher call signature to pass event dict with session_key embedded. Root cause: gateway's _run_process_watcher tried adapter.handle_message() for api_server platform, but api_server is an HTTP/SSE handler, not a gateway adapter. Notification was silently dropped with 'no routing metadata'. Now the gateway pushes process.completed events directly to the api_server's SSE queue when platform is api_server.
…otifications Add _session_to_run reverse mapping and push_process_event() to APIServerAdapter. Delete duplicate old push_process_event stub. Fix gateway watcher to pass event dict with session_key embedded. Root cause: gateway _run_process_watcher tried adapter.handle_message() for api_server but api_server is HTTP/SSE, not a gateway adapter. Notification silently dropped with 'no routing metadata' warning.
… not just HEAD~10
…nt=100 PRAGMA synchronous=NORMAL defers fsync in WAL mode, leaving kanban.db vulnerable to corruption when a process is SIGKILL'd mid-transaction or when WAL frames are partially written during concurrent access. Two changes: - synchronous=NORMAL → FULL: ensures every WAL frame is fsync'd before the write completes, preventing WAL/main-DB inconsistency - wal_autocheckpoint=100: limits WAL growth to 100 pages between automatic checkpoints, bounding checkpoint I/O and reducing the window where a SIGKILL can leave a large WAL in a fragile state Together these match the fix proposed in upstream PRs NousResearch#30969 / NousResearch#30973. Fixes: NousResearch#31502
Collaborator
|
Superseded by #31731 (clean replacement — same author, focused 1-file diff vs stacked branch). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
PRAGMA synchronous=NORMAL(line 1184 ofhermes_cli/kanban_db.py) defers fsync in WAL mode, leavingkanban.dbvulnerable to corruption when a process is SIGKILL'd mid-transaction or when WAL frames are partially written during concurrent access. This causesdatabase disk image is malformederrors after ~9-10 rapid task creations.Root Cause
synchronous=NORMALmeans SQLite only syncs at checkpoint boundaries, not after every WAL frame write. If a writer process is killed mid-transaction (SIGKILL from reclaim, gateway shutdown, OOM killer), WAL frames may be written but the main DB is left in an inconsistent state. Next connection: malformed DB.Fix — two changes
synchronous=NORMAL→FULL: ensures every WAL frame is fsync'd before the write completes, preventing WAL/main-DB inconsistency even after SIGKILLwal_autocheckpoint=100: caps WAL at 100 pages between automatic checkpoints — bounds the checkpoint I/O spike and reduces the window where a large WAL is fragileThese match the fix proposed in upstream PRs #30969 / #30973 which are not yet merged.
Trade-off
synchronous=FULLadds one fsync per write — ~1ms overhead on SSD, ~5-10ms on HDD. For kanban (create/show/complete operations, not a high-throughput OLTP path), this is negligible vs. the cost of DB corruption and manual recovery.2 lines, 1 file.
Fixes: #31502
Related: #31618, #30896