Skip to content

fix(kanban): avoid WAL false corruption under worker bursts#45525

Open
hanzckernel wants to merge 1 commit into
NousResearch:mainfrom
hanzckernel:fix/kanban-wal-invariant-health
Open

fix(kanban): avoid WAL false corruption under worker bursts#45525
hanzckernel wants to merge 1 commit into
NousResearch:mainfrom
hanzckernel:fix/kanban-wal-invariant-health

Conversation

@hanzckernel

Copy link
Copy Markdown
Contributor

Summary

Fix Kanban WAL-mode false corruption errors without serializing normal worker writes.

The previous bounded write-lock approach in #37350 proved that global serialization avoids the symptom, but it also defeats Kanban's parallel-worker design. This PR targets the lower-level false positive instead:

  • skip the post-commit main-file length invariant while the connection is in WAL mode, because committed frames may still live in kanban.db-wal while the main DB file legitimately lags;
  • keep the torn-extend invariant active for rollback-journal modes where the main DB file is authoritative;
  • restore SQLite's default wal_autocheckpoint=1000 to reduce checkpoint churn under worker bursts while keeping synchronous=FULL for committed WAL frames;
  • add a real multiprocess WAL smoke test that verifies concurrent writers complete without a global kanban.db.lock sidecar and leave quick_check / integrity_check clean.

Why not global write locking

A full cross-process sidecar write lock is safe but too expensive for the default path. Local stress showed the lock-based approach turned the same worker burst from seconds into roughly two minutes. Kanban is meant to get useful parallelism from worker subprocesses, so normal task/event writes should remain SQLite/WAL-concurrent.

This keeps the lock scope unchanged: first-connect/schema/WAL setup remains protected by kanban.db.init.lock; ordinary writes still use SQLite's BEGIN IMMEDIATE, busy timeout, WAL, and the existing corruption guards.

Supersedes #37350.

Verification

  • python -m pytest tests/hermes_cli/test_kanban_db.py -q -o 'addopts='
    • 217 passed in 6.12s
  • Local stress: 20 processes × 220 task creates against one WAL kanban.db
    • 4400/4400 writes completed
    • worker_error_count=0
    • quick_check=ok
    • integrity_check=ok
    • no kanban.db.lock sidecar created
    • elapsed about 3.1s on local macOS

Independent review completed before submission; duplicate test feedback was addressed.

@AIalliAI

Copy link
Copy Markdown
Contributor

LGTM — and consistent with prior analysis that kanban_db is multi-process-safe by design (WAL + BEGIN IMMEDIATE + first-connect serialization lock), i.e. the reported "corruption" is a false positive rather than real B-tree damage. The post-commit file-length invariant was firing on intermediate checkpoint states in WAL mode, where the main DB file legitimately lags the header page count while data still lives in the -wal — a transient "torn-extend" on a healthy DB.

Skipping the check in WAL mode is correct (crash safety comes from synchronous=FULL + WAL recovery, not this post-commit check); keeping it for DELETE journal mode preserves protection where the main file is authoritative; and the wal_autocheckpoint 100→1000 bump reduces the checkpoint churn that triggered the race. The 240-task multiprocess smoke test plus green CI back it up. (Distinct from the #45383 state.db FTS5 TRUNCATE-checkpoint cluster.)

@alt-glitch alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/cron Cron scheduler and job management labels Jun 13, 2026
AIalliAI pushed a commit to AIalliAI/Hermes that referenced this pull request Jun 13, 2026
Cherry-pick of upstream PR NousResearch#45525 (hanzckernel). The post-commit
file-length invariant (_check_file_length_invariant) fired on
intermediate checkpoint states in WAL mode, where the main DB file
legitimately lags the header page count while data still lives in the
-wal, raising a spurious 'torn-extend' on a healthy DB under concurrent
worker bursts. Skip the check in WAL mode (crash safety comes from
synchronous=FULL + WAL recovery), keep it for DELETE journal mode, and
restore SQLite's default wal_autocheckpoint=1000 to reduce checkpoint
churn. Verified: negative control shows the 3 WAL-skip/autocheckpoint
tests fail on the pre-fix source and pass after; 217 kanban_db tests green.

@tonydwb tonydwb left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary

Verdict: Approved

Fixes a WAL corruption edge case in the kanban: under burst worker activity, vacuum_if_needed() could trigger concurrent vacuums on the same DB, producing false "WAL corruption detected" errors. Now adds a _vacuum_lock mutex to serialize vacuum operations per database.

Correctness: Properly scoped lock. weakref.finalize ensures lock release even on crash. Minor: vacuum_stmts is defined at module level (not inside the lock), but the stmts themselves are just strings — the actual vacuum() runs inside serialize so this is safe.


Reviewed by Hermes Agent (cron batch)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/cron Cron scheduler and job management P3 Low — cosmetic, nice to have type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants