fix: BTRFS COW + SQLite WAL incompatibility — disk I/O errors on BTRFS

## Environment

- **Hermes Agent:** latest (main, 2026-05-23)
- **OS:** Arch Linux
- **Python:** 3.11.15
- **Filesystem:** BTRFS (with Copy-on-Write enabled, compress=zstd:3, ssd)
- **SQLite:** 3.53.1
- **Affected databases:** `state.db`, `kanban.db`

## Problem

When Hermes Agent runs on a BTRFS filesystem, SQLite databases operating in WAL (Write-Ahead Logging) mode can experience `disk I/O error` failures due to BTRFS Copy-on-Write semantics interacting with concurrent write operations.

The issue manifests as:
1. `sqlite3.OperationalError: disk I/O error` during WAL checkpoint operations
2. Worker processes hanging on database locks
3. Gateway crashes and stale task claims
4. Silent database corruption risk

### Why it happens

- WAL mode relies on shared memory (`-shm` files) and sequential writes
- BTRFS COW operations can modify disk blocks after WAL records them
- Without proper handling, concurrent writers block each other or cause I/O errors

## Proposed Solution

Add `_is_on_btrfs()` detection that proactively skips WAL mode on BTRFS and falls back to `journal_mode=DELETE`. This avoids silent corruption because the exception-based fallback in `apply_wal_with_fallback()` would be too late (data is already corrupted by that point).

### Changes

Three files modified:

1. **`hermes_state.py`** — added `_is_on_btrfs()` function that checks `/proc/self/mountinfo` for BTRFS filesystems, and updated `apply_wal_with_fallback()` to accept `db_path` and proactively skip WAL on BTRFS
2. **`hermes_cli/kanban_db.py`** — pass `db_path` to `apply_wal_with_fallback()` so BTRFS detection works for kanban database
3. **`tools/terminal_tool.py`** — added `_safe_getcwd()` helper that falls back to home directory when `os.getcwd()` raises `FileNotFoundError` (e.g. when CWD was deleted). Fixes cleanup thread crashes

### Testing

Tested on Arch Linux, BTRFS (compress=zstd:3, ssd), SQLite 3.53.1:
- 5 concurrent writers + 3 readers, 50 operations each
- **Result:** 400 operations, 0 errors, 0.50s total

### Performance impact

WAL mode is 30-50% faster than DELETE mode for concurrent writes. On BTRFS, the fallback to DELETE mode reduces concurrency but ensures data integrity. Users who need WAL performance can use `chattr +C` on their Hermes directory to disable COW per-directory.

## Related

- [SQLite WAL documentation](https://www.sqlite.org/wal.html)
- [SQLite Performance on Btrfs](https://wiki.tnonline.net/w/Blog/SQLite_Performance_on_Btrfs) — WAL mode recommended for BTRFS with proper settings
- [FreshRSS #3853](https://github.com/FreshRSS/FreshRSS/issues/3853) — Similar issue resolved with WAL mode
- [Patch repository](https://github.com/savier89/hermes-btrfs-fix) — Full patch and test results

## Open Questions

1. Should we expose a configuration flag (`database.journal_mode: auto | wal | delete`) for users to override the automatic fallback?
2. Should we add CI tests that spin up a temporary BTRFS filesystem and verify the agent starts without SQLite errors?
3. Are there other COW filesystems (ZFS, APFS) that exhibit similar incompatibilities?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: BTRFS COW + SQLite WAL incompatibility — disk I/O errors on BTRFS #30846

Environment

Problem

Why it happens

Proposed Solution

Changes

Testing

Performance impact

Related

Open Questions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

fix: BTRFS COW + SQLite WAL incompatibility — disk I/O errors on BTRFS #30846

Description

Environment

Problem

Why it happens

Proposed Solution

Changes

Testing

Performance impact

Related

Open Questions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions