Environment
- Hermes Agent: latest (main, 2026-05-23)
- OS: Arch Linux
- Python: 3.11.15
- Filesystem: BTRFS (with Copy-on-Write enabled, compress=zstd:3, ssd)
- SQLite: 3.53.1
- Affected databases:
state.db, kanban.db
Problem
When Hermes Agent runs on a BTRFS filesystem, SQLite databases operating in WAL (Write-Ahead Logging) mode can experience disk I/O error failures due to BTRFS Copy-on-Write semantics interacting with concurrent write operations.
The issue manifests as:
sqlite3.OperationalError: disk I/O error during WAL checkpoint operations
- Worker processes hanging on database locks
- Gateway crashes and stale task claims
- Silent database corruption risk
Why it happens
- WAL mode relies on shared memory (
-shm files) and sequential writes
- BTRFS COW operations can modify disk blocks after WAL records them
- Without proper handling, concurrent writers block each other or cause I/O errors
Proposed Solution
Add _is_on_btrfs() detection that proactively skips WAL mode on BTRFS and falls back to journal_mode=DELETE. This avoids silent corruption because the exception-based fallback in apply_wal_with_fallback() would be too late (data is already corrupted by that point).
Changes
Three files modified:
hermes_state.py — added _is_on_btrfs() function that checks /proc/self/mountinfo for BTRFS filesystems, and updated apply_wal_with_fallback() to accept db_path and proactively skip WAL on BTRFS
hermes_cli/kanban_db.py — pass db_path to apply_wal_with_fallback() so BTRFS detection works for kanban database
tools/terminal_tool.py — added _safe_getcwd() helper that falls back to home directory when os.getcwd() raises FileNotFoundError (e.g. when CWD was deleted). Fixes cleanup thread crashes
Testing
Tested on Arch Linux, BTRFS (compress=zstd:3, ssd), SQLite 3.53.1:
- 5 concurrent writers + 3 readers, 50 operations each
- Result: 400 operations, 0 errors, 0.50s total
Performance impact
WAL mode is 30-50% faster than DELETE mode for concurrent writes. On BTRFS, the fallback to DELETE mode reduces concurrency but ensures data integrity. Users who need WAL performance can use chattr +C on their Hermes directory to disable COW per-directory.
Related
Open Questions
- Should we expose a configuration flag (
database.journal_mode: auto | wal | delete) for users to override the automatic fallback?
- Should we add CI tests that spin up a temporary BTRFS filesystem and verify the agent starts without SQLite errors?
- Are there other COW filesystems (ZFS, APFS) that exhibit similar incompatibilities?
Environment
state.db,kanban.dbProblem
When Hermes Agent runs on a BTRFS filesystem, SQLite databases operating in WAL (Write-Ahead Logging) mode can experience
disk I/O errorfailures due to BTRFS Copy-on-Write semantics interacting with concurrent write operations.The issue manifests as:
sqlite3.OperationalError: disk I/O errorduring WAL checkpoint operationsWhy it happens
-shmfiles) and sequential writesProposed Solution
Add
_is_on_btrfs()detection that proactively skips WAL mode on BTRFS and falls back tojournal_mode=DELETE. This avoids silent corruption because the exception-based fallback inapply_wal_with_fallback()would be too late (data is already corrupted by that point).Changes
Three files modified:
hermes_state.py— added_is_on_btrfs()function that checks/proc/self/mountinfofor BTRFS filesystems, and updatedapply_wal_with_fallback()to acceptdb_pathand proactively skip WAL on BTRFShermes_cli/kanban_db.py— passdb_pathtoapply_wal_with_fallback()so BTRFS detection works for kanban databasetools/terminal_tool.py— added_safe_getcwd()helper that falls back to home directory whenos.getcwd()raisesFileNotFoundError(e.g. when CWD was deleted). Fixes cleanup thread crashesTesting
Tested on Arch Linux, BTRFS (compress=zstd:3, ssd), SQLite 3.53.1:
Performance impact
WAL mode is 30-50% faster than DELETE mode for concurrent writes. On BTRFS, the fallback to DELETE mode reduces concurrency but ensures data integrity. Users who need WAL performance can use
chattr +Con their Hermes directory to disable COW per-directory.Related
Open Questions
database.journal_mode: auto | wal | delete) for users to override the automatic fallback?