Skip to content

fix: BTRFS COW + SQLite WAL incompatibility — disk I/O errors on BTRFS #30846

@savier89

Description

@savier89

Environment

  • Hermes Agent: latest (main, 2026-05-23)
  • OS: Arch Linux
  • Python: 3.11.15
  • Filesystem: BTRFS (with Copy-on-Write enabled, compress=zstd:3, ssd)
  • SQLite: 3.53.1
  • Affected databases: state.db, kanban.db

Problem

When Hermes Agent runs on a BTRFS filesystem, SQLite databases operating in WAL (Write-Ahead Logging) mode can experience disk I/O error failures due to BTRFS Copy-on-Write semantics interacting with concurrent write operations.

The issue manifests as:

  1. sqlite3.OperationalError: disk I/O error during WAL checkpoint operations
  2. Worker processes hanging on database locks
  3. Gateway crashes and stale task claims
  4. Silent database corruption risk

Why it happens

  • WAL mode relies on shared memory (-shm files) and sequential writes
  • BTRFS COW operations can modify disk blocks after WAL records them
  • Without proper handling, concurrent writers block each other or cause I/O errors

Proposed Solution

Add _is_on_btrfs() detection that proactively skips WAL mode on BTRFS and falls back to journal_mode=DELETE. This avoids silent corruption because the exception-based fallback in apply_wal_with_fallback() would be too late (data is already corrupted by that point).

Changes

Three files modified:

  1. hermes_state.py — added _is_on_btrfs() function that checks /proc/self/mountinfo for BTRFS filesystems, and updated apply_wal_with_fallback() to accept db_path and proactively skip WAL on BTRFS
  2. hermes_cli/kanban_db.py — pass db_path to apply_wal_with_fallback() so BTRFS detection works for kanban database
  3. tools/terminal_tool.py — added _safe_getcwd() helper that falls back to home directory when os.getcwd() raises FileNotFoundError (e.g. when CWD was deleted). Fixes cleanup thread crashes

Testing

Tested on Arch Linux, BTRFS (compress=zstd:3, ssd), SQLite 3.53.1:

  • 5 concurrent writers + 3 readers, 50 operations each
  • Result: 400 operations, 0 errors, 0.50s total

Performance impact

WAL mode is 30-50% faster than DELETE mode for concurrent writes. On BTRFS, the fallback to DELETE mode reduces concurrency but ensures data integrity. Users who need WAL performance can use chattr +C on their Hermes directory to disable COW per-directory.

Related

Open Questions

  1. Should we expose a configuration flag (database.journal_mode: auto | wal | delete) for users to override the automatic fallback?
  2. Should we add CI tests that spin up a temporary BTRFS filesystem and verify the agent starts without SQLite errors?
  3. Are there other COW filesystems (ZFS, APFS) that exhibit similar incompatibilities?

Metadata

Metadata

Assignees

No one assigned

    Labels

    P3Low — cosmetic, nice to havearea/configConfig system, migrations, profilescomp/agentCore agent loop, run_agent.py, prompt buildertype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions