Bug: KanbanDbCorruptError backup writes a new .bak on every dispatcher tick → runaway disk usage
Hermes version: v2026.5.16-881-g186bf25cb (HEAD as of 2026-05-24)
Profile: non-default profile (sophia)
Symptom: 7,862 .corrupt.*.bak files (1.7 GB) accumulated in a single per-profile kanban board directory over ~37 hours, with no built-in pruning.
What happened
A board's kanban.db became corrupt at a specific point in time:
sqlite3.OperationalError: disk I/O error
(later re-classified by the newer guard as database disk image is malformed)
After the corruption, gateway/run.py:_tick_once_for_board() calls _kb.connect(board=slug) every minute, which calls _guard_existing_db_is_healthy() (hermes_cli/kanban_db.py:1132). The guard correctly refuses to open the corrupt DB and raises KanbanDbCorruptError — but it also writes a NEW .corrupt.*.bak of the same corrupt file on every tick.
Over ~37 hours and across periods of high-retry activity (with .1.bak, .2.bak, .3.bak suffixes from sub-second retries), this accumulated 7,861 backup files at ~224 KB each = ~1.7 GB, all bit-identical (or near-identical) copies of the same corrupt source.
The directory just before cleanup:
$ ls /home/.../kanban/boards/<slug>/ | wc -l
7861 # all kanban.db.corrupt.<timestamp>.bak files
$ du -sh /home/.../kanban/boards/<slug>/
1.7G
Filename pattern with sub-second collision suffixes:
kanban.db.corrupt.20260525_190559.1.bak
kanban.db.corrupt.20260525_190559.2.bak
kanban.db.corrupt.20260525_190559.3.bak
kanban.db.corrupt.20260525_190604.bak
kanban.db.corrupt.20260525_190609.bak
... (~3.5 backups/min average, peaking higher)
Each backup is essentially a clone of the same bytes — the source corrupt DB doesn't change between dispatcher ticks (no successful writes possible), so the backups are duplicative.
Expected behavior
When the guard detects a corrupt DB:
- Back it up once (first detection) as
kanban.db.corrupt.<timestamp>.bak
- On subsequent ticks, skip the backup write if the corrupt source file's
mtime/size/sha256 matches the last .bak
- Optionally: cap total
.bak files in the dir to N (e.g., 5) with FIFO eviction
This bounds disk impact and stops the runaway.
Suggested fix sketch (in _guard_existing_db_is_healthy)
def _guard_existing_db_is_healthy(path):
# ... existing corruption check ...
if corrupt:
# Find most recent existing .bak
existing_baks = sorted(glob(f"{path}.corrupt.*.bak"))
latest_bak = existing_baks[-1] if existing_baks else None
# Skip if corrupt source hasn't changed since last backup
if latest_bak and _file_signatures_match(path, latest_bak):
raise KanbanDbCorruptError(path, latest_bak, reason)
# Otherwise back up fresh
new_bak = f"{path}.corrupt.{timestamp()}.bak"
shutil.copy2(path, new_bak)
# Optional: FIFO eviction
if len(existing_baks) >= MAX_CORRUPT_BAKS:
os.remove(existing_baks[0])
raise KanbanDbCorruptError(path, new_bak, reason)
Where _file_signatures_match could compare (st_size, st_mtime_ns) for cheapness, or sha256 for correctness.
Workaround used
Used Hermes' own kanban boards delete <slug> (default = archive) to remove the corrupt board, followed by kanban boards create <slug> to recreate fresh. Worked cleanly. The destructive cleanup of the 7,861 stale .bak files happened as a side effect of the boards delete action since the entire board dir is removed.
Root cause of the original corruption
Unknown. We never identified the trigger. The first OperationalError: disk I/O error fired at 2026-05-25 14:59:13 UTC and journalctl for the surrounding 10-minute window showed no kernel events, no disk pressure, no OOM, no apt activity, no fail2ban — nothing systemic. Disk had 36 GB free throughout. Both kanban DBs passed PRAGMA integrity_check at the time on a different (non-active) path; the corruption was confined to one nested per-board DB.
The corruption may correlate with concurrent Honcho/compression refactoring that occurred ~2-3 hours earlier (visible from backup filenames in the profile's migration scripts dir), but this is circumstantial.
This bug report focuses only on the unbounded-backup behavior, not the underlying corruption cause.
Why this matters
A bot with an active kanban board can silently consume gigabytes of disk over days without operator intervention, with no notification mechanism beyond raw journal log spam. The runaway behavior turned a small 104 KB corruption into a 1.7 GB problem, and would have continued indefinitely without manual cleanup.
Reproducibility
Difficult to repro the corruption itself, but the runaway-backup behavior is trivial to repro:
- Run any Hermes profile with kanban enabled.
- Corrupt the per-board
kanban.db (e.g., truncate to 64 bytes, overwrite a random middle page).
- Restart the gateway.
- Wait. Watch
ls | wc -l in the board dir grow.
Environment
- Hermes Agent:
v2026.5.16-881-g186bf25cb
- OS: Ubuntu (Hetzner VPS)
- Python: 3.x via
~/.hermes/hermes-agent/venv
- Profile setup: non-default profile (
sophia) at ~/.hermes/profiles/sophia/, alongside default profile at ~/.hermes/
- Service: per-profile systemd unit (
hermes-gateway-sophia.service)
Bug:
KanbanDbCorruptErrorbackup writes a new.bakon every dispatcher tick → runaway disk usageHermes version:
v2026.5.16-881-g186bf25cb(HEAD as of 2026-05-24)Profile: non-default profile (
sophia)Symptom: 7,862
.corrupt.*.bakfiles (1.7 GB) accumulated in a single per-profile kanban board directory over ~37 hours, with no built-in pruning.What happened
A board's
kanban.dbbecame corrupt at a specific point in time:(later re-classified by the newer guard as
database disk image is malformed)After the corruption,
gateway/run.py:_tick_once_for_board()calls_kb.connect(board=slug)every minute, which calls_guard_existing_db_is_healthy()(hermes_cli/kanban_db.py:1132). The guard correctly refuses to open the corrupt DB and raisesKanbanDbCorruptError— but it also writes a NEW.corrupt.*.bakof the same corrupt file on every tick.Over ~37 hours and across periods of high-retry activity (with
.1.bak,.2.bak,.3.baksuffixes from sub-second retries), this accumulated 7,861 backup files at ~224 KB each = ~1.7 GB, all bit-identical (or near-identical) copies of the same corrupt source.The directory just before cleanup:
Filename pattern with sub-second collision suffixes:
Each backup is essentially a clone of the same bytes — the source corrupt DB doesn't change between dispatcher ticks (no successful writes possible), so the backups are duplicative.
Expected behavior
When the guard detects a corrupt DB:
kanban.db.corrupt.<timestamp>.bakmtime/size/sha256matches the last.bak.bakfiles in the dir to N (e.g., 5) with FIFO evictionThis bounds disk impact and stops the runaway.
Suggested fix sketch (in
_guard_existing_db_is_healthy)Where
_file_signatures_matchcould compare(st_size, st_mtime_ns)for cheapness, orsha256for correctness.Workaround used
Used Hermes' own
kanban boards delete <slug>(default = archive) to remove the corrupt board, followed bykanban boards create <slug>to recreate fresh. Worked cleanly. The destructive cleanup of the 7,861 stale.bakfiles happened as a side effect of theboards deleteaction since the entire board dir is removed.Root cause of the original corruption
Unknown. We never identified the trigger. The first
OperationalError: disk I/O errorfired at2026-05-25 14:59:13 UTCand journalctl for the surrounding 10-minute window showed no kernel events, no disk pressure, no OOM, no apt activity, no fail2ban — nothing systemic. Disk had 36 GB free throughout. Both kanban DBs passedPRAGMA integrity_checkat the time on a different (non-active) path; the corruption was confined to one nested per-board DB.The corruption may correlate with concurrent Honcho/compression refactoring that occurred ~2-3 hours earlier (visible from backup filenames in the profile's migration scripts dir), but this is circumstantial.
This bug report focuses only on the unbounded-backup behavior, not the underlying corruption cause.
Why this matters
A bot with an active kanban board can silently consume gigabytes of disk over days without operator intervention, with no notification mechanism beyond raw journal log spam. The runaway behavior turned a small
104 KBcorruption into a1.7 GBproblem, and would have continued indefinitely without manual cleanup.Reproducibility
Difficult to repro the corruption itself, but the runaway-backup behavior is trivial to repro:
kanban.db(e.g., truncate to 64 bytes, overwrite a random middle page).ls | wc -lin the board dir grow.Environment
v2026.5.16-881-g186bf25cb~/.hermes/hermes-agent/venvsophia) at~/.hermes/profiles/sophia/, alongsidedefaultprofile at~/.hermes/hermes-gateway-sophia.service)