Bug Description
apply_wal_with_fallback() in hermes_state.py fails completely when ~/.hermes is on an APFS external SSD. Both WAL and DELETE journal modes throw "disk I/O error". The DELETE fallback is uncaught, so the exception propagates up and crashes every caller that depends on a SQLite connection — kanban dispatcher, SessionDB init, API server, holographic memory store, etc.
Environment
- macOS 26.5
- APFS external SSD (Thunderbolt / USB-C)
~/.hermes lives on the external volume
- SQLite 3.x (system default)
Root Cause
The fix from #22032 added apply_wal_with_fallback() with _WAL_INCOMPAT_MARKERS including "disk i/o error". WAL failures matching these correctly trigger a DELETE fallback on line 160. However, when DELETE also fails with a disk I/O error (as seen on APFS external SSDs), that exception is NOT caught — it propagates out unhandled:
except sqlite3.OperationalError as exc:
msg = str(exc).lower()
if not any(marker in msg for marker in _WAL_INCOMPAT_MARKERS):
raise
_log_wal_fallback_once(db_label, exc)
conn.execute("PRAGMA journal_mode=DELETE") # <-- UNCAUGHT
return "delete"
Impact on callers:
- SessionDB.init (
hermes_state.py:354): caught by its own except, sets _last_init_error, re-raises → session DB stays None, features like /resume, /title, /history silently break
- kanban_db.connect() (
kanban_db.py:1050): caught by its own except, closes connection, re-raises → kanban dispatcher crashes every 60s when the dashboard is open
- api_server.py (line 349): same pattern → response store unavailable
- plugins/memory/holographic/store.py (line 134): same pattern → holographic memory store fails
Workaround
Manually set journal_mode=DELETE + run VACUUM on the databases:
sqlite3 ~/.hermes/state.db "PRAGMA journal_mode=DELETE; VACUUM;"
sqlite3 ~/.hermes/kanban/default.db "PRAGMA journal_mode=DELETE; VACUUM;"
This persists DELETE mode in the DB header, so subsequent connections start with DELETE and never trigger the WAL fallback path.
Proposed Fix
Wrap the DELETE fallback in a try/except. If both WAL and DELETE fail, log a warning and continue with the connection's default journal mode.
def apply_wal_with_fallback(
conn: sqlite3.Connection,
*,
db_label: str = "state.db",
) -> str:
try:
conn.execute("PRAGMA journal_mode=WAL")
return "wal"
except sqlite3.OperationalError as exc:
msg = str(exc).lower()
if not any(marker in msg for marker in _WAL_INCOMPAT_MARKERS):
raise
_log_wal_fallback_once(db_label, exc)
try:
conn.execute("PRAGMA journal_mode=DELETE")
return "delete"
except sqlite3.OperationalError as delete_exc:
logger.warning(
"%s: both WAL and DELETE journal_mode failed "
"(WAL: %s, DELETE: %s). "
"Continuing with default journal mode.",
db_label, exc, delete_exc,
)
return "delete"
Tests to update
test_captures_cause_on_failed_init in tests/test_hermes_state_wal_fallback.py currently expects SessionDB() to raise when both pragmas fail. With the fix, SessionDB would succeed (both errors caught internally). Update the test to verify:
SessionDB() succeeds despite both journal_mode pragmas failing
- The connection is usable for reads/writes
- A warning is logged (new test or extend the existing one)
All callers (would benefit from the fix without any changes)
hermes_state.py:354 — SessionDB.init
hermes_cli/kanban_db.py:1050 — kanban_db.connect()
gateway/platforms/api_server.py:349 — ResponseStore init
plugins/memory/holographic/store.py:134 — MemoryStore init
Bug Description
apply_wal_with_fallback()inhermes_state.pyfails completely when~/.hermesis on an APFS external SSD. Both WAL and DELETE journal modes throw "disk I/O error". The DELETE fallback is uncaught, so the exception propagates up and crashes every caller that depends on a SQLite connection — kanban dispatcher, SessionDB init, API server, holographic memory store, etc.Environment
~/.hermeslives on the external volumeRoot Cause
The fix from #22032 added
apply_wal_with_fallback()with_WAL_INCOMPAT_MARKERSincluding"disk i/o error". WAL failures matching these correctly trigger a DELETE fallback on line 160. However, when DELETE also fails with a disk I/O error (as seen on APFS external SSDs), that exception is NOT caught — it propagates out unhandled:Impact on callers:
hermes_state.py:354): caught by its ownexcept, sets_last_init_error, re-raises → session DB staysNone, features like/resume,/title,/historysilently breakkanban_db.py:1050): caught by its ownexcept, closes connection, re-raises → kanban dispatcher crashes every 60s when the dashboard is openWorkaround
Manually set
journal_mode=DELETE+ run VACUUM on the databases:This persists DELETE mode in the DB header, so subsequent connections start with DELETE and never trigger the WAL fallback path.
Proposed Fix
Wrap the DELETE fallback in a try/except. If both WAL and DELETE fail, log a warning and continue with the connection's default journal mode.
Tests to update
test_captures_cause_on_failed_initintests/test_hermes_state_wal_fallback.pycurrently expectsSessionDB()to raise when both pragmas fail. With the fix, SessionDB would succeed (both errors caught internally). Update the test to verify:SessionDB()succeeds despite both journal_mode pragmas failingAll callers (would benefit from the fix without any changes)
hermes_state.py:354— SessionDB.inithermes_cli/kanban_db.py:1050— kanban_db.connect()gateway/platforms/api_server.py:349— ResponseStore initplugins/memory/holographic/store.py:134— MemoryStore init