Kanban dispatcher writes a new .corrupt.bak on every tick after corruption (7,861 backups / 1.7 GB over 37h)

# Bug: `KanbanDbCorruptError` backup writes a new `.bak` on every dispatcher tick → runaway disk usage

**Hermes version:** `v2026.5.16-881-g186bf25cb` (HEAD as of 2026-05-24)
**Profile:** non-default profile (`sophia`)
**Symptom:** 7,862 `.corrupt.*.bak` files (1.7 GB) accumulated in a single per-profile kanban board directory over ~37 hours, with no built-in pruning.

## What happened

A board's `kanban.db` became corrupt at a specific point in time:
```
sqlite3.OperationalError: disk I/O error
```
(later re-classified by the newer guard as `database disk image is malformed`)

After the corruption, `gateway/run.py:_tick_once_for_board()` calls `_kb.connect(board=slug)` every minute, which calls `_guard_existing_db_is_healthy()` (`hermes_cli/kanban_db.py:1132`). The guard correctly refuses to open the corrupt DB and raises `KanbanDbCorruptError` — **but it also writes a NEW `.corrupt.*.bak` of the same corrupt file on every tick**.

Over ~37 hours and across periods of high-retry activity (with `.1.bak`, `.2.bak`, `.3.bak` suffixes from sub-second retries), this accumulated **7,861 backup files at ~224 KB each = ~1.7 GB**, all bit-identical (or near-identical) copies of the same corrupt source.

The directory just before cleanup:
```
$ ls /home/.../kanban/boards/<slug>/ | wc -l
7861   # all kanban.db.corrupt.<timestamp>.bak files

$ du -sh /home/.../kanban/boards/<slug>/
1.7G
```

Filename pattern with sub-second collision suffixes:
```
kanban.db.corrupt.20260525_190559.1.bak
kanban.db.corrupt.20260525_190559.2.bak
kanban.db.corrupt.20260525_190559.3.bak
kanban.db.corrupt.20260525_190604.bak
kanban.db.corrupt.20260525_190609.bak
... (~3.5 backups/min average, peaking higher)
```

Each backup is essentially a clone of the same bytes — the source corrupt DB doesn't change between dispatcher ticks (no successful writes possible), so the backups are duplicative.

## Expected behavior

When the guard detects a corrupt DB:
- Back it up **once** (first detection) as `kanban.db.corrupt.<timestamp>.bak`
- On subsequent ticks, **skip the backup write** if the corrupt source file's `mtime`/`size`/`sha256` matches the last `.bak`
- Optionally: cap total `.bak` files in the dir to N (e.g., 5) with FIFO eviction

This bounds disk impact and stops the runaway.

## Suggested fix sketch (in `_guard_existing_db_is_healthy`)

```python
def _guard_existing_db_is_healthy(path):
    # ... existing corruption check ...
    if corrupt:
        # Find most recent existing .bak
        existing_baks = sorted(glob(f"{path}.corrupt.*.bak"))
        latest_bak = existing_baks[-1] if existing_baks else None

        # Skip if corrupt source hasn't changed since last backup
        if latest_bak and _file_signatures_match(path, latest_bak):
            raise KanbanDbCorruptError(path, latest_bak, reason)

        # Otherwise back up fresh
        new_bak = f"{path}.corrupt.{timestamp()}.bak"
        shutil.copy2(path, new_bak)

        # Optional: FIFO eviction
        if len(existing_baks) >= MAX_CORRUPT_BAKS:
            os.remove(existing_baks[0])

        raise KanbanDbCorruptError(path, new_bak, reason)
```

Where `_file_signatures_match` could compare `(st_size, st_mtime_ns)` for cheapness, or `sha256` for correctness.

## Workaround used

Used Hermes' own `kanban boards delete <slug>` (default = archive) to remove the corrupt board, followed by `kanban boards create <slug>` to recreate fresh. Worked cleanly. The destructive cleanup of the 7,861 stale `.bak` files happened as a side effect of the `boards delete` action since the entire board dir is removed.

## Root cause of the original corruption

Unknown. We never identified the trigger. The first `OperationalError: disk I/O error` fired at `2026-05-25 14:59:13 UTC` and journalctl for the surrounding 10-minute window showed no kernel events, no disk pressure, no OOM, no apt activity, no fail2ban — nothing systemic. Disk had 36 GB free throughout. Both kanban DBs passed `PRAGMA integrity_check` at the time on a different (non-active) path; the corruption was confined to one nested per-board DB.

The corruption may correlate with concurrent Honcho/compression refactoring that occurred ~2-3 hours earlier (visible from backup filenames in the profile's migration scripts dir), but this is circumstantial.

This bug report focuses **only on the unbounded-backup behavior**, not the underlying corruption cause.

## Why this matters

A bot with an active kanban board can silently consume gigabytes of disk over days without operator intervention, with no notification mechanism beyond raw journal log spam. The runaway behavior turned a small `104 KB` corruption into a `1.7 GB` problem, and would have continued indefinitely without manual cleanup.

## Reproducibility

Difficult to repro the corruption itself, but the runaway-backup behavior is trivial to repro:

1. Run any Hermes profile with kanban enabled.
2. Corrupt the per-board `kanban.db` (e.g., truncate to 64 bytes, overwrite a random middle page).
3. Restart the gateway.
4. Wait. Watch `ls | wc -l` in the board dir grow.

## Environment

- Hermes Agent: `v2026.5.16-881-g186bf25cb`
- OS: Ubuntu (Hetzner VPS)
- Python: 3.x via `~/.hermes/hermes-agent/venv`
- Profile setup: non-default profile (`sophia`) at `~/.hermes/profiles/sophia/`, alongside `default` profile at `~/.hermes/`
- Service: per-profile systemd unit (`hermes-gateway-sophia.service`)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kanban dispatcher writes a new .corrupt.bak on every tick after corruption (7,861 backups / 1.7 GB over 37h) #32593

Bug: `KanbanDbCorruptError` backup writes a new `.bak` on every dispatcher tick → runaway disk usage

What happened

Expected behavior

Suggested fix sketch (in `_guard_existing_db_is_healthy`)

Workaround used

Root cause of the original corruption

Why this matters

Reproducibility

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Kanban dispatcher writes a new .corrupt.bak on every tick after corruption (7,861 backups / 1.7 GB over 37h) #32593

Description

Bug: KanbanDbCorruptError backup writes a new .bak on every dispatcher tick → runaway disk usage

What happened

Expected behavior

Suggested fix sketch (in _guard_existing_db_is_healthy)

Workaround used

Root cause of the original corruption

Why this matters

Reproducibility

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Bug: `KanbanDbCorruptError` backup writes a new `.bak` on every dispatcher tick → runaway disk usage

Suggested fix sketch (in `_guard_existing_db_is_healthy`)