Skip to content

fix(kanban): change synchronous=NORMAL to FULL + add wal_autocheckpoint=100#31726

Closed
someaka wants to merge 55 commits into
NousResearch:mainfrom
someaka:fix/31502-kanban-synchronous-full
Closed

fix(kanban): change synchronous=NORMAL to FULL + add wal_autocheckpoint=100#31726
someaka wants to merge 55 commits into
NousResearch:mainfrom
someaka:fix/31502-kanban-synchronous-full

Conversation

@someaka

@someaka someaka commented May 24, 2026

Copy link
Copy Markdown

Problem

PRAGMA synchronous=NORMAL (line 1184 of hermes_cli/kanban_db.py) defers fsync in WAL mode, leaving kanban.db vulnerable to corruption when a process is SIGKILL'd mid-transaction or when WAL frames are partially written during concurrent access. This causes database disk image is malformed errors after ~9-10 rapid task creations.

Root Cause

synchronous=NORMAL means SQLite only syncs at checkpoint boundaries, not after every WAL frame write. If a writer process is killed mid-transaction (SIGKILL from reclaim, gateway shutdown, OOM killer), WAL frames may be written but the main DB is left in an inconsistent state. Next connection: malformed DB.

Fix — two changes

  1. synchronous=NORMALFULL: ensures every WAL frame is fsync'd before the write completes, preventing WAL/main-DB inconsistency even after SIGKILL
  2. wal_autocheckpoint=100: caps WAL at 100 pages between automatic checkpoints — bounds the checkpoint I/O spike and reduces the window where a large WAL is fragile

These match the fix proposed in upstream PRs #30969 / #30973 which are not yet merged.

Trade-off

synchronous=FULL adds one fsync per write — ~1ms overhead on SSD, ~5-10ms on HDD. For kanban (create/show/complete operations, not a high-throughput OLTP path), this is negligible vs. the cost of DB corruption and manual recovery.

2 lines, 1 file.

Fixes: #31502
Related: #31618, #30896

Ed and others added 30 commits May 25, 2026 01:35
… prompts

Replace cronjob-based /loop with same-session timer-driven LoopManager (modeled on GoalManager).
The current agent in the current session periodically wakes up, executes the prompt, and reports back
inline — no separate processes, no popups, full context continuity.

Changes:
- hermes_cli/loop.py: LoopState dataclass + LoopManager class with SessionDB persistence
- hermes_cli/commands.py: /loop CommandDef → Session category, no alias, no subcommands
- cli.py: _get_loop_manager, _handle_loop_command rewritten, _maybe_continue_loop_after_turn hook
- gateway/run.py: _get_loop_manager_for_event, _handle_loop_command rewritten,
  _post_turn_loop_continuation, _is_loop_continuation_event, _clear_loop_pending_continuations,
  _loop_still_active_for_session, _send_loop_status_notice, _defer_loop_status_notice_after_delivery
- Continuation prompt format: '[Loop check] {prompt}' (distinguishable from goal continuation)
- Goal takes priority over loop when both active
- Interrupt guard: auto-pauses on Ctrl+C
- Empty response guard: skips on empty responses

Tests: 47/47 core pass (25 LoopManager + 22 gateway). 13 CLI tests have pre-existing
pytest-xdist/prompt_toolkit stdout capture issue — pass individually.

Kanban: 8-phase pipeline complete (investigate-A/B/C, plan, verify-plan, implement, verify-impl, final-verify)
Changes:
- hermes_cli/loop.py: Add MIN_INTERVAL_SECONDS=60, clamp interval, remove [Loop check] prefix and on_message popup
- cli.py: Remove [Loop check] from kickoff enqueue
- gateway/run.py: Replace no-op _post_turn_loop_continuation with real timer logic, add daemon ticker thread
- tests: Updated test expectations for new behaviour

Known gap: _dispatch_loop_prompt needed for TUI path — see docs/plans/loop-fix-plan.md for Tasks 1-3
… TUI gateway

- cli.py: skip LoopScheduler spawn + immediate kickoff when
  HERMES_SLASH_WORKER or HERMES_SESSION_KEY env is set (slash_worker
  is ephemeral — any thread it spawns dies with the process)
- tui_gateway/server.py: add _ensure_loop_ticker() — per-session
  daemon thread that re-reads loop state from SessionDB each second,
  fires immediately on first tick, and handles session-key rotation
  from compression; wired after agent build and after each turn
…ch callback

LoopScheduler now takes a single 'dispatch(prompt) -> bool' callback
instead of queue/is_idle/on_message plumbing. Both CLI and TUI paths
use the SAME class — the only difference is what dispatch does:

  CLI: dispatch puts prompt into _pending_input queue
  TUI: dispatch calls _run_prompt_submit() to inject
  Slash_worker: LoopManager(dispatch=None) — persist only, no scheduler

Changes:
- hermes_cli/loop.py: refactor LoopScheduler (+del _state, +dispatch),
  add dispatch param to LoopManager, auto-start on set/resume,
  add delete() and _del_loop_meta() + delete_loop() convenience helper
- cli.py: collapse 3x _is_slash_worker checks into 1 (in
  _get_loop_manager), add _make_loop_dispatch(), simplify /loop handlers
- tui_gateway/server.py: delete _ensure_loop_ticker() (~65 lines),
  replace with _make_tui_dispatch() + LoopManager in session dict,
  remove post-turn re-ensure (LoopManager handles own lifecycle)

Net: -45 lines, zero duplicated tick logic, clean delete API.
Two import-time crashes fixed:
- cli.py: add Callable to typing imports (needed by _make_loop_dispatch
  returning Callable[[str], bool])
- tui_gateway/server.py: add Callable to typing imports (needed by
  _make_tui_dispatch signature), fix key variable shadowed in _build()
  closure — assigning 'key' inside _build() made Python treat all
  earlier key reads as local, causing UnboundLocalError
The TUI gateway creates LoopManager before the slash_worker persists
the loop to SessionDB.  Previously the scheduler only started when a
loop already existed, so the gateway's ticker slept forever.

Two changes:
1. LoopManager.__init__: always start scheduler when dispatch is set,
   not only when a persisted active loop exists
2. LoopScheduler._tick: when load_loop() returns None, just keep
   polling instead of setting _running=False — the slash_worker may
   persist the loop on the very next tick
…ew loop

Each /loop [interval] <prompt> creates a new loop with a 6-char UUID.
No names, no overwriting — pure UIDs like #a3f1c2.  List, pause,
resume, and clear target loops by UID.

Storage: loop:<sid>:<uid> keys + loop:<sid>:__ids__ registry.
All name-based logic removed — parser, manager, and both handlers
(CLI + TUI gateway) now use uid-based interface.
Each active loop now shows time remaining until next dispatch.
Examples: next: now, next: 42s, next: 3m 15s, next: 1h 23m.
Computed from last_fired_at + interval_seconds vs time.time().
Bug 1 — Double fire on /loop create:
The TUI handler returned {type: send, message: prompt} which
immediately submitted the prompt as a user message. The scheduler
then also fired on next tick (last_fired_at=0). Changed to
{type: exec, output: notice} so only the notice shows; scheduler
handles the actual first dispatch.

Bug 2 — Countdown always shows 'now':
LoopManager._states was never refreshed after scheduler wrote
last_fired_at to SessionDB. status_line() now calls load_all_loops()
to pull fresh state before rendering countdown.
- plugins/memory/__init__.py: restore get_active_memory_providers()
  supporting memory.providers list (multi) and memory.provider (legacy)
- agent/memory_manager.py: restore multi-provider add_provider(),
  remove_provider(), thread safety, schema-before-mutation safety
- agent/agent_init.py: loop over all active providers instead of
  loading a single one
- tests: update monkeypatches for renamed function + list returns
…n, process_registry — fork sync May 2026
Extends the previous commit to cover the remaining additive-column index
that sits on the same migration trap:

- ``task_events.run_id`` -> ``idx_events_run`` was still in SCHEMA_SQL.
  A legacy ``task_events`` table predating NousResearch#17805 (no ``run_id``) would
  still abort ``executescript`` before ``_migrate_add_optional_columns``
  could add the column. Hoisted out of SCHEMA_SQL and made unconditional
  in the migration alongside the other three indexes.

- Removed the now-redundant ``CREATE INDEX idx_tasks_idempotency`` that
  was nested inside the ``if "idempotency_key" not in cols`` branch.
  The unconditional create lower in the function makes it idempotent
  on both fresh and legacy DBs.

- Strengthened the regression test to cover all four indexes
  (``idx_tasks_session_id``, ``idx_tasks_tenant``, ``idx_tasks_idempotency``,
  ``idx_events_run``) and to seed a pre-NousResearch#17805 ``task_events`` shape that
  exercises the ``run_id`` migration path.

The result: every ``CREATE INDEX`` that depends on an additive column now
runs after the migration ensures the column exists. Verified against a
realistic pre-NousResearch#16081 board fixture (tasks + task_events both legacy
shape) — origin/main reproduces ``no such column: session_id``; this
branch migrates cleanly and creates all four indexes.
…am merge

The merge of upstream/main (9faa3cc89) brought in the call site
(agent_init.py) and tests but missed the ContextCompressor.__init__
signature change, causing ~528 test failures with:

  TypeError: ContextCompressor.__init__() got an unexpected
  keyword argument 'abort_on_summary_failure'

Adds the missing parameter and the abort-on-failure logic block
in compress() to match upstream at 9aae59f.
Four categories of fixes for fork additions clobbered by upstream merge:

1. gateway/run.py — 3-way surgical insert of loop dispatch methods
   (_dispatch_loop_prompt, _is_loop_continuation_event,
   _clear_loop_pending_continuations, _loop_still_active_for_session,
   _get_loop_manager_for_event, _send_loop_status_notice,
   _defer_loop_status_notice_after_delivery,
   _post_turn_loop_continuation) into upstream base preserving all
   upstream additions (noise filter, reconnect, bundles, platform
   commands, probe_audio, telegram topic mode)

2. tools/file_operations.py — _check_git_baseline() method +
   warning wiring in ShellFileOperations.write_file

3. hermes_cli/loop.py — save_loop/load_loop public aliases

4. agent/context_compressor.py — abort_on_summary_failure param
   + _last_compress_aborted logic (cherry-pick c22770b2e)

All fixes verified with Python syntax check + attribute presence.
…add missing blank lines between loop methods
The /loop command was registered under 'Tools & Skills' category but the
test test_loop_command_category expects it in 'Session' (alongside /background,
/queue, /steer, etc.). Moving to Session is semantically correct since /loop
is a per-session recurring prompt runner, similar to /background and /queue.
…ler guard; run_agent.py config delegation to agent_init

- _load_loop: assign state.id=uid when deserialized id is None (old JSON compat)
- LoopManager.set(): skip fork-pattern scheduler when dispatch scheduler active
- run_agent.py: remove 447 lines of compression/skills/tool-use config now in agent_init.py
- All 31 loop_manager tests + 23 run_agent regression tests pass
- Remove pre-drain mark_resume_pending block (NousResearch#27856 follow-up):
  The upstream pre-drain block pre-marked ALL running sessions before
  drain, then tried to clear on clean drain. This caused mark_resume_pending
  to fire on clean drains (violating test contract) and to mark sessions
  that finished during the drain window (stray resume notes).

  SIGKILL safety is preserved: if killed mid-drain, .clean_shutdown is
  not written, so suspend_recently_active() on next boot marks sessions
  as resume_pending with reason='restart_interrupted' which is in
  _AUTO_RESUME_REASONS — functionally identical auto-resume.

- Verified post-timeout block only marks sessions still in
  _running_agents (not drain-start snapshot), skips _AGENT_PENDING_SENTINEL,
  uses correct reason strings — all matching test expectations.

- Verified skipif fix: bool(os.getenv('CI')) correctly returns Python
  bool instead of string, avoiding pytest eval('true') NameError.

- All 72 tests pass (67 restart-resume + 5 memory-perf).
The test_alias_resolves_to_canonical test expects codex_runtime to resolve
to codex-runtime (for Telegram platform underscore-only naming), but the
CommandDef had no underscore alias. This matches the existing pattern set
by reload-mcp → reload_mcp.

All 1679 hermes_cli tests pass (after fixed 1 previously-existing failure).
Pre-existing failures in test_gateway_wsl/test_doctor are env/platform issues
not introduced by worker changes.
…edrock extra, kanban skills/notify, gateway stdin capture

- hermes_cli/doctor.py: move codex CLI hint into the Codex auth block with proper gating
  (index under provider section, suppress when codex already installed)
- hermes_cli/commands.py [already committed]: quit CommandDef gets args_hint='[--delete]'
- gateway/config.py: add Google Chat to _PLATFORM_CONNECTED_CHECKERS
- pyproject.toml: exclude bedrock extras from [all] (boto3 is heavy)
- hermes_cli/kanban.py: remove CLI auto-subscribe in _cmd_create to avoid double-sub with gateway handler
- tests/hermes_cli/test_kanban_core_functionality.py: kanban_home fixture creates kanban-worker skill dir
- tests/hermes_cli/test_gateway_service.py: mock prompt_yes_no + fix systemd_install lambda + stub systemd_start
- tests/hermes_cli/test_gateway_wsl.py: mock prompt_yes_no + stub systemd_start

Co-authored-by: inspector t_573b52d9
…andler

- _cmd_create: restore CLI auto-subscribe (removed in c0f1a2ce4),
  gated on new _gateway_kanban_in_progress module-level flag
- _handle_kanban_command: set flag before run_slash, clear in finally
- test: gateway test sets _HERMES_GATEWAY=1 to match production env

This fixes test_cli_create_auto_subscribes without re-breaking
test_gateway_create_autosubscribes_on_explicit_board.
…failure

The original broad except Exception silently swallowed all setup errors
including partial state where sys.stdout was wrapped but state['installed']
was never set to True. This caused isinstance(sys.stdout, _UpdateOutputStream)
to fail in CI when pytest-xdist worker forking triggered an import-path
mismatch mid-setup.

Fix: restore prev_stdout/prev_stderr in the except handler if the streams
were partially replaced but the install didn't complete, so the test's
isinstance check always sees consistent state regardless of where the
failure occurred within the try block.
Radical Edward and others added 22 commits May 25, 2026 01:35
…oss tests

The _cleanup_envs finalizer in the vercel sandbox test fixture was
calling env.cleanup() which writes snapshot metadata to the snapshot
store. This leaked {task-123: snap_default} across test boundaries on
Python 3.12 CI + xdist, causing test_persistent_cleanup_without_task_id
to fail with AssertionError: {'task-123': 'snap_default'} == {}.

Fix: set env._persistent = False in the finalizer so cleanup doesn't
write test-observable snapshot state. The tests themselves already call
cleanup() and verify snapshot state explicitly.

Fixes t_b8fcdc5c
…cess_notification

The pre-input drain in process_loop() called _format_process_notification(evt)
which does not exist anywhere in the codebase. The correct name is
format_process_notification from tools.process_registry. The bare except
around this block silently swallowed the NameError, causing all
completion_queue notifications to be silently dropped in the pre-input
check path. Idle and post-turn drains (which use drain_notifications()
correctly) were unaffected.
Adds zero-polling kanban→TUI notification delivery using a Unix FIFO:

- kanban_db.py: _append_event now writes a JSON line to
  ~/.hermes/tui_kanban.fifo when notification-worthy events are
  written (completed, blocked, gave_up, crashed, timed_out).
  Best-effort — silently skipped if FIFO doesn't exist.

- tui_gateway/server.py: Creates the FIFO at module level and
  starts a per-session daemon reader thread. The thread's open()
  blocks at the OS level — zero polling, zero CPU when idle.
  On data arrival, queries kanban DB for CLI-subscribed tasks,
  formats via _format_kanban_notification, emits status.update
  for TUI visibility, and injects as agent turn via _run_prompt_submit.

- FIFO cleaned up on gateway exit via atexit handler so stale
  FIFOs never block future startup.
27 tests covering the zero-polling kanban→TUI FIFO notification bridge:

Writer side (kanban_db._append_event):
- All 5 notify kinds (completed/blocked/crashed/timed_out/gave_up) write JSON to FIFO
- Non-notify kinds (created/heartbeat) do NOT write
- Missing FIFO does not raise
- Multiple events produce multiple lines

Reader side (tui_gateway.server._format_kanban_notification):
- All 5 event kinds format correctly with/without payload
- Unknown kind returns None
- Summary/reason truncation (200/160 chars)
- Dict-based event access (DB row format)

Lifecycle:
- FIFO created on module import
- atexit cleanup unlinks FIFO

End-to-end:
- Complete task → FIFO → reader receives correct JSON
- Non-notify events invisible to reader
…TUI dedup

Bug B: poll() was marking completions as consumed, which prevented the
TUI notification poller from delivering notify_on_complete notifications.

- Remove self._completion_consumed.add(session_id) from poll()
- Add mark_completion_consumed() method to ProcessRegistry
- Call mark_completion_consumed() from TUI poller after successful delivery

This restores notification delivery while preventing duplicates via
explicit TUI-side consumption tracking.
…comments

Bug A (cli.py typo) and Bug B (poll() marking consumed) were already
fixed in prior commits. This cleans up the remaining stale docstring
and comments that still referenced 'wait/poll/log' — poll() no longer
marks completions as consumed.
…r dispatch

Adds test_notification_poller_marks_consumed_after_dispatch to ensure
the TUI notification poller marks completions as consumed after
successfully dispatching them, preventing duplicate notifications.
…otifications

Add _session_to_run reverse mapping (session_key -> run_id) to APIServerAdapter.
Add push_process_event() method that looks up the run's SSE queue by session_key.
Fix gateway watcher call signature to pass event dict with session_key embedded.

Root cause: gateway's _run_process_watcher tried adapter.handle_message() for
api_server platform, but api_server is an HTTP/SSE handler, not a gateway adapter.
Notification was silently dropped with 'no routing metadata'.

Now the gateway pushes process.completed events directly to the api_server's SSE
queue when platform is api_server.
…otifications

Add _session_to_run reverse mapping and push_process_event() to APIServerAdapter.
Delete duplicate old push_process_event stub. Fix gateway watcher to pass
event dict with session_key embedded.

Root cause: gateway _run_process_watcher tried adapter.handle_message() for
api_server but api_server is HTTP/SSE, not a gateway adapter. Notification
silently dropped with 'no routing metadata' warning.
…nt=100

PRAGMA synchronous=NORMAL defers fsync in WAL mode, leaving kanban.db
vulnerable to corruption when a process is SIGKILL'd mid-transaction
or when WAL frames are partially written during concurrent access.

Two changes:
- synchronous=NORMAL → FULL: ensures every WAL frame is fsync'd before
  the write completes, preventing WAL/main-DB inconsistency
- wal_autocheckpoint=100: limits WAL growth to 100 pages between
  automatic checkpoints, bounding checkpoint I/O and reducing the
  window where a SIGKILL can leave a large WAL in a fragile state

Together these match the fix proposed in upstream PRs NousResearch#30969 / NousResearch#30973.

Fixes: NousResearch#31502
@someaka someaka requested a review from a team May 24, 2026 23:42
@someaka someaka closed this May 24, 2026
@someaka someaka deleted the fix/31502-kanban-synchronous-full branch May 24, 2026 23:51
@alt-glitch alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/cli CLI entry point, hermes_cli/, setup wizard labels May 24, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Superseded by #31731 (clean replacement — same author, focused 1-file diff vs stacked branch).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/cli CLI entry point, hermes_cli/, setup wizard P3 Low — cosmetic, nice to have type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Kanban SQLite database corruption under rapid task creation

4 participants