Skip to content

Kanban SQLite database corruption under rapid task creation #31502

@KorepanovSA

Description

@KorepanovSA

Bug Report: Kanban SQLite database corruption under rapid task creation

Summary

The kanban SQLite database (~/.hermes/kanban.db) becomes corrupted (database disk image is malformed) when creating ~9-10 tasks in rapid succession via the kanban_create tool API. This has happened 3 times in 2 days under normal orchestrator workflow.

Environment

  • Hermes Agent version: v0.14.0 (2026.5.16)
  • Python: 3.11.15
  • OS: Ubuntu 22.04 (Linux 6.8.0-117-generic)
  • SQLite: bundled with Python 3.11

Steps to Reproduce

  1. Start gateway: hermes gateway start
  2. Create tasks via kanban_create tool API in a loop (or rapid succession)
  3. After ~9-10 tasks, the next kanban_create call fails with:
    {"error": "kanban_create: database disk image is malformed"}
    
  4. All subsequent kanban operations fail with the same error

Observed Behavior

  • kanban.db-wal file becomes 0 bytes after corruption
  • The database file itself appears intact in size but is unreadable by SQLite
  • Previously created tasks are lost
  • Gateway continues running but cannot dispatch tasks

Expected Behavior

  • Creating 10+ tasks sequentially should not corrupt the database
  • WAL mode should handle concurrent access safely
  • If corruption occurs, it should be recoverable without full re-initialization

Recovery Steps (currently required)

hermes gateway stop
cp ~/.hermes/kanban.db ~/.hermes/kanban.db.backup.$(date +%Y%m%d_%H%M%S)
rm -f ~/.hermes/kanban.db-shm ~/.hermes/kanban.db-wal
mv ~/.hermes/kanban.db ~/.hermes/kanban.db.corrupted.$(date +%Y%m%d_%H%M%S)
hermes kanban init
hermes gateway start

Additional Context

  • The issue occurs when using the tool API (kanban_create), not CLI commands
  • We added 1-second delays between kanban_create calls as a workaround, but this is not a fix
  • The dispatcher holds an open DB connection; concurrent writes from tool API calls may race with WAL checkpointing
  • Previous corruption incidents: 2025-05-23 (twice), 2025-05-24 (once)

Suggested Investigation

  1. Check if the kanban DB connection uses proper transaction isolation
  2. Verify WAL checkpoint behavior under rapid writes
  3. Consider adding an application-level write queue or mutex for kanban operations
  4. Add automatic WAL recovery on startup if -wal or -shm files are stale

Attachments

  • Will attach kanban.db and kanban.db-wal from next corruption incident if helpful

Metadata

Metadata

Assignees

No one assigned

    Labels

    P3Low — cosmetic, nice to havecomp/pluginsPlugin system and bundled pluginstype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions