fix(migration): rebuild channel_database on PostgreSQL tombstone exhaustion (#3751)#3753
Conversation
…ustion (#3751) PostgreSQL hard-limits every table to 1600 total column "tombstones" (attnum slots, including previously dropped columns). An older version of migration 021 added `sourceId` to `channel_database`; migration 063 drops it. Because PostgreSQL has no migration-completion tracking and re-runs all migrations on every startup, this ADD/DROP cycle consumed one tombstone slot per restart. Users with ≥1600 restarts on the old code hit the limit when migration 104 tries to add `channelHash`. Fix: migration 104's PostgreSQL path now: 1. Short-circuits if `channelHash` already exists (full idempotency). 2. Queries `pg_attribute` for total attnum count (live + tombstones). 3. When the count is ≥1500, rebuilds `channel_database` from scratch: - Creates `channel_database_new` with the canonical schema (incl. `channelHash`). - Copies only the live columns from the old table (discovered dynamically via `pg_attribute`), so the query is safe regardless of which optional columns a given deployment has. - Resets the SERIAL sequence. - Drops the old table (CASCADE removes the FK from `channel_database_permissions` automatically). - Renames the new table + sequence into place. - Re-attaches the FK on `channel_database_permissions`. 4. Falls through to the normal `ADD COLUMN IF NOT EXISTS` when no tombstone issue exists. No data loss — all channel keys and settings are copied. The root cause (no PG migration-completion tracking) is tracked separately; this fix ensures affected users can start again after upgrading. Closes #3751 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_013T3CWviKTCWzzQ5kvst2T4
|
Claude finished @Yeraze's task —— View job PR Review: fix(migration): rebuild channel_database on PostgreSQL tombstone exhaustion
SummaryThis is a well-reasoned fix for a subtle but serious PostgreSQL crash path. The root cause analysis is accurate (migration 021/063 ADD/DROP cycle consuming one attnum per restart), and the chosen fix (table rebuild) is the correct approach — there is no in-place way to reclaim tombstoned attnums in PostgreSQL. The code is generally high quality. I have a few observations below, mostly minor. Issues / Observations1. Sequence reset sets value to
|
…d atomic Release-audit (PR #3753) flagged the tombstone-exhaustion rebuild path as a data-availability risk: it DROPs the live channel_database and RENAMEs a fresh copy into place, but the migration runner wraps each migration's statements in no transaction, so each DDL auto-commits. A crash/connection drop in the DROP→RENAME window would leave the database with no channel_database — and the migration's own idempotency check (`'channel_database'::regclass`) would then throw "relation does not exist" on the next startup, blocking recovery. Wrap the rebuild in BEGIN/COMMIT with ROLLBACK on error. PostgreSQL DDL is transactional, so any failure now rolls back to the original table intact. (Channel keys/PSKs were already copied before the drop, so this hardens the narrow availability window rather than fixing data loss. The path only fires at ≥1500 attnum tombstones — 1500+ restarts on the old migration 021 bug — so it's rare, but the blast radius warranted the fix.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011JEaCGwY9Wz8jeV4e22GW4
* chore(release): bump version to 4.12.0 Finalize 4.12.0 (from 4.12.0-rc2): package.json, package-lock.json, desktop/package.json, desktop/src-tauri/tauri.conf.json, helm/Chart.yaml. Promote CHANGELOG [Unreleased] → [4.12.0]. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011JEaCGwY9Wz8jeV4e22GW4 * docs(4.12.0): document Automation Engine, MeshCore regions/scopes & UI additions Release-audit of every PR since v4.11.5 surfaced documentation gaps. Brings the docs site current with the 4.12.0 feature set: - New page docs/features/automation-engine.md (+ sidebar nav) for the node/ builder-based Automation Engine (#3653 et al). v-pre-wrapped so the {{ }} token syntax renders literally without breaking the VitePress build. - meshcore.md: new Regions/Scopes section (#3667 et al), channel unread indicators, heard-repeaters badge, per-message route line, byte counter, on-demand telemetry polling, Define Path editor, repeater DM-hidden behavior. - automation.md: corrected stale auto-favorite eligibility table (#3786/#3774). - maps.md: traceroute default-keyed channel routing + channel dropdowns (#3723). - map-analysis.md: rich multi-source node popups (#3692). - device.md: unmessagable-node DM-hidden behavior (#3760). Verified: vitepress build passes; tokens render literally. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011JEaCGwY9Wz8jeV4e22GW4 * fix(migration): make migration 104 PostgreSQL channel_database rebuild atomic Release-audit (PR #3753) flagged the tombstone-exhaustion rebuild path as a data-availability risk: it DROPs the live channel_database and RENAMEs a fresh copy into place, but the migration runner wraps each migration's statements in no transaction, so each DDL auto-commits. A crash/connection drop in the DROP→RENAME window would leave the database with no channel_database — and the migration's own idempotency check (`'channel_database'::regclass`) would then throw "relation does not exist" on the next startup, blocking recovery. Wrap the rebuild in BEGIN/COMMIT with ROLLBACK on error. PostgreSQL DDL is transactional, so any failure now rolls back to the original table intact. (Channel keys/PSKs were already copied before the drop, so this hardens the narrow availability window rather than fixing data loss. The path only fires at ≥1500 attnum tombstones — 1500+ restarts on the old migration 021 bug — so it's rare, but the blast radius warranted the fix.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011JEaCGwY9Wz8jeV4e22GW4 --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Summary
Fixes #3751 — PostgreSQL crashes on startup with
tables can have at most 1600 columnsduring migration 104.Root cause
PostgreSQL hard-limits every table to 1600 total attnum slots (column "tombstones"), counting every
ADD COLUMNever executed, including columns that were later dropped. An older version of migration 021 includedchannel_databasein itsADD COLUMNloop. Migration 063 dropssourceIdfromchannel_database. Because PostgreSQL re-runs all migrations on every startup (no completion tracking, unlike SQLite'ssettingsKey), this created a one-tombstone-per-restart cycle:The bug in migration 021 was already fixed (it now explicitly excludes
channel_database), and no new tombstones accumulate after that fix. However, users who accumulated ≥1600 tombstones before upgrading hit this error when migration 104 tries to addchannelHash.Fix
Migration 104's PostgreSQL path now:
channelHashis already a live column (full idempotency on repeat startups).pg_attribute.channel_databasefrom scratch:channel_database_newwith the canonical schema (includingchannelHash).pg_attributeand copies only those — safe regardless of which optional columns a deployment has.DROP TABLE channel_database CASCADE— automatically removes the FK constraint onchannel_database_permissions(data is preserved).channel_database_permissions.ADD COLUMN IF NOT EXISTSwhen the table is healthy.No data loss — all channel keys and settings are copied during the rebuild.
Test plan
7501 passed, 3 failed— the 3 failures are pre-existing (mqttBrokerManager.test.ts) unrelated to this change🤖 Generated with Claude Code
https://claude.ai/code/session_013T3CWviKTCWzzQ5kvst2T4
Generated by Claude Code