Skip to content

[Bug]: kanban.db corruption when multiple profile gateways share the same board DB #32424

@ParsifalC

Description

@ParsifalC

Description

When multiple Hermes profile gateways (e.g., --profile bingge, --profile pixiel, --profile mafei) run concurrently and share the same kanban board DB (~/.hermes/kanban.db), the SQLite database becomes corrupted. The corruption specifically affects the kanban_notify_subs table indexes.

This is not a false positive from PRAGMA integrity_check — the indexes genuinely become inconsistent with the table data.

Environment

  • macOS (APFS filesystem)
  • SQLite 3.51.0
  • Hermes Agent (latest main branch)
  • 4 gateway processes: default + 3 named profiles, all sharing kanban.db by design

Root Cause Analysis

Architecture

  • kanban_home() intentionally resolves to ~/.hermes/ for all profiles (by design, per the docstring: "The kanban board is shared across profiles")
  • Each profile gateway runs its own dispatcher, which opens independent SQLite connections
  • CLI commands (hermes kanban create/complete/block/link) also open new connections

The Race

Multiple processes concurrently execute BEGIN IMMEDIATE write transactions against the same DB. While SQLite WAL mode supports concurrent readers + single writer per connection, concurrent WAL checkpoints from separate processes can corrupt the main DB file.

Evidence

  1. 4 gateway processes had open file descriptors on kanban.db at time of corruption
  2. Last events before corruption show rapid concurrent activity from different profile dispatchers:
    • bingge gateway: completed Sprint 3 PRD (13:04:34)
    • pixiel gateway: spawned Sprint 3 design task (13:04:36)
    • mafei gateway: protocol_violation → gave_up → re-spawned Sprint 2 (13:05:04)
  3. Corruption was in kanban_notify_subs indexes (idx_notify_task + sqlite_autoindex_kanban_notify_subs_1)
  4. PRAGMA integrity_check returned:
    Tree 10 page 10: btreeInitPage() returns error code 11
    wrong # of entries in index idx_notify_task
    wrong # of entries in index sqlite_autoindex_kanban_notify_subs_1
    

Steps to Reproduce

  1. Start 3+ profile gateways: hermes --profile X gateway run --replace
  2. Run kanban operations that trigger concurrent writes (task create + claim + notify-subscribe)
  3. Observe corruption after ~30-60 minutes of active use

Suggested Fixes

Short-term

Add a file-level advisory lock (fcntl.flock) around all kanban write operations in kanban_db.py. The existing BEGIN IMMEDIATE handles SQLite-level serialization, but doesn't protect against concurrent WAL checkpoints from separate processes.

Medium-term

Serialize kanban writes through a single writer process/thread. Each gateway could send write requests to a central kanban writer instead of opening independent connections.

Long-term

Consider PostgreSQL as an optional backend for multi-profile setups. SQLite's WAL mode has documented limitations with concurrent writers from separate processes.

Workaround

Currently working around by:

  1. Monitoring for corruption via PRAGMA integrity_check
  2. Recovering from .recover.*.sql dumps when corruption is detected
  3. Restarting all gateways after recovery

This is fragile — the recovery SQL can be stale, losing recent task state.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium — degraded but workaround existscomp/cronCron scheduler and job managementtype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions