Problem
SQLite WAL mode conflicts with BTRFS COW causing lock timeouts and Dashboard/Kanban failures.
Root Cause
- BTRFS COW + SQLite WAL = concurrent write conflicts
- Default busy_timeout too short (1000ms)
- No retry logic for WAL initialization
Solution (tested in hermes_state.py)
- Increased busy_timeout from 1000ms to 30000ms (30 sec)
- Added retry logic: 3 attempts, 1s delay between tries
- Proper exception handling and fallback on failure
Test Results
- 400/400 concurrent operations completed successfully
- 0 errors under load
- Services stable after restart (hermes-gateway, hermes-dashboard)
Patch Location
See: https://github.com/savier89/hermes-btrfs-fix
Files Changed
hermes_state.py: WAL initialization with busy_timeout + retry loop
Why This Works
BTRFS COW causes temporary file locking during writes. WAL mode needs longer timeouts and retry to handle these transient conflicts. 30s timeout + 3 retries gives enough time for BTRFS to complete COW operations without falling back to DELETE mode.
Problem
SQLite WAL mode conflicts with BTRFS COW causing lock timeouts and Dashboard/Kanban failures.
Root Cause
Solution (tested in hermes_state.py)
Test Results
Patch Location
See: https://github.com/savier89/hermes-btrfs-fix
Files Changed
hermes_state.py: WAL initialization with busy_timeout + retry loopWhy This Works
BTRFS COW causes temporary file locking during writes. WAL mode needs longer timeouts and retry to handle these transient conflicts. 30s timeout + 3 retries gives enough time for BTRFS to complete COW operations without falling back to DELETE mode.