Skip to content

fix(migration): rebuild channel_database on PostgreSQL tombstone exhaustion (#3751)#3753

Merged
Yeraze merged 1 commit into
mainfrom
claude/great-dijkstra-v95lft
Jun 25, 2026
Merged

fix(migration): rebuild channel_database on PostgreSQL tombstone exhaustion (#3751)#3753
Yeraze merged 1 commit into
mainfrom
claude/great-dijkstra-v95lft

Conversation

@Yeraze

@Yeraze Yeraze commented Jun 25, 2026

Copy link
Copy Markdown
Owner

Summary

Fixes #3751 — PostgreSQL crashes on startup with tables can have at most 1600 columns during migration 104.

Root cause

PostgreSQL hard-limits every table to 1600 total attnum slots (column "tombstones"), counting every ADD COLUMN ever executed, including columns that were later dropped. An older version of migration 021 included channel_database in its ADD COLUMN loop. Migration 063 drops sourceId from channel_database. Because PostgreSQL re-runs all migrations on every startup (no completion tracking, unlike SQLite's settingsKey), this created a one-tombstone-per-restart cycle:

Startup N:     migration 021 → ADD COLUMN sourceId    (+1 tombstone)
               migration 063 → DROP COLUMN sourceId   (slot consumed)
Startup N+1:   migration 021 → ADD COLUMN sourceId    (+1 tombstone)
               ...
Startup ~1600: migration 104 → ADD COLUMN channelHash → ERROR

The bug in migration 021 was already fixed (it now explicitly excludes channel_database), and no new tombstones accumulate after that fix. However, users who accumulated ≥1600 tombstones before upgrading hit this error when migration 104 tries to add channelHash.

Fix

Migration 104's PostgreSQL path now:

  1. Short-circuits if channelHash is already a live column (full idempotency on repeat startups).
  2. Checks total attnum count (live + tombstones) via pg_attribute.
  3. If count ≥ 1500 — rebuilds channel_database from scratch:
    • Creates channel_database_new with the canonical schema (including channelHash).
    • Discovers live columns dynamically via pg_attribute and copies only those — safe regardless of which optional columns a deployment has.
    • Resets the SERIAL sequence to max(id).
    • DROP TABLE channel_database CASCADE — automatically removes the FK constraint on channel_database_permissions (data is preserved).
    • Renames new table and sequence into place.
    • Re-attaches the FK on channel_database_permissions.
  4. Falls through to the normal ADD COLUMN IF NOT EXISTS when the table is healthy.

No data loss — all channel keys and settings are copied during the rebuild.

Test plan

  • All 3 existing SQLite migration 104 tests still pass
  • Full Vitest suite: 7501 passed, 3 failed — the 3 failures are pre-existing (mqttBrokerManager.test.ts) unrelated to this change
  • Tombstone detection and rebuild logic reviewed for correctness across normal, already-migrated, and exhausted cases
  • CI suite

🤖 Generated with Claude Code

https://claude.ai/code/session_013T3CWviKTCWzzQ5kvst2T4


Generated by Claude Code

…ustion (#3751)

PostgreSQL hard-limits every table to 1600 total column "tombstones"
(attnum slots, including previously dropped columns).  An older version
of migration 021 added `sourceId` to `channel_database`; migration 063
drops it.  Because PostgreSQL has no migration-completion tracking and
re-runs all migrations on every startup, this ADD/DROP cycle consumed
one tombstone slot per restart.  Users with ≥1600 restarts on the old
code hit the limit when migration 104 tries to add `channelHash`.

Fix: migration 104's PostgreSQL path now:
1. Short-circuits if `channelHash` already exists (full idempotency).
2. Queries `pg_attribute` for total attnum count (live + tombstones).
3. When the count is ≥1500, rebuilds `channel_database` from scratch:
   - Creates `channel_database_new` with the canonical schema (incl.
     `channelHash`).
   - Copies only the live columns from the old table (discovered
     dynamically via `pg_attribute`), so the query is safe regardless
     of which optional columns a given deployment has.
   - Resets the SERIAL sequence.
   - Drops the old table (CASCADE removes the FK from
     `channel_database_permissions` automatically).
   - Renames the new table + sequence into place.
   - Re-attaches the FK on `channel_database_permissions`.
4. Falls through to the normal `ADD COLUMN IF NOT EXISTS` when no
   tombstone issue exists.

No data loss — all channel keys and settings are copied.  The root
cause (no PG migration-completion tracking) is tracked separately;
this fix ensures affected users can start again after upgrading.

Closes #3751

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_013T3CWviKTCWzzQ5kvst2T4
@github-actions

github-actions Bot commented Jun 25, 2026

Copy link
Copy Markdown

Claude finished @Yeraze's task —— View job


PR Review: fix(migration): rebuild channel_database on PostgreSQL tombstone exhaustion

  • Read migration implementation file
  • Read migration test file
  • Analyze code quality, bugs, performance, security
  • Post comprehensive review feedback

Summary

This is a well-reasoned fix for a subtle but serious PostgreSQL crash path. The root cause analysis is accurate (migration 021/063 ADD/DROP cycle consuming one attnum per restart), and the chosen fix (table rebuild) is the correct approach — there is no in-place way to reclaim tombstoned attnums in PostgreSQL. The code is generally high quality. I have a few observations below, mostly minor.


Issues / Observations

1. Sequence reset sets value to MAX(id) — off by one on next insert

File: 104_add_channel_database_hash.ts:116-121

SELECT setval(
  'channel_database_new_id_seq',
  COALESCE((SELECT MAX(id) FROM channel_database_new), 0)
)

setval(seq, n) sets the last-returned value to n, so the next nextval() will return n + 1. This is correct behavior and matches the intent ("advance past the highest copied id"). However, there's an edge case: if channel_database_new is empty (no rows), COALESCE returns 0, and setval(seq, 0) will make the next insert get id=1 — which is fine. No bug here, but worth a comment for the next reader since the idiom is sometimes confused with the two-argument form setval(seq, n, false).

Low priority / doc nit.

2. Tombstone threshold of 1500 is somewhat arbitrary — consider documenting the margin

File: 104_add_channel_database_hash.ts:177

if (tombstoneCount >= 1500) {

The hard PostgreSQL limit is 1600, so 1500 gives a 100-slot buffer. That's probably fine for this table (future migrations are unlikely to add 100 more columns to channel_database), but it would be helpful to add a brief inline comment explaining the chosen value, e.g.:

// Rebuild if within 100 of PostgreSQL's 1600 attnum hard limit.
if (tombstoneCount >= 1500) {

Nit.

3. DROP TABLE channel_database CASCADE comment is slightly misleading

File: 104_add_channel_database_hash.ts:123-125

// DROP CASCADE removes all FK constraints in referencing tables
// (rows in channel_database_permissions are preserved).
await client.query(`DROP TABLE channel_database CASCADE`);

DROP TABLE ... CASCADE drops dependent objects (FK constraints, views), but it does not drop the rows in referencing tables — rows in channel_database_permissions pointing at the now-dropped table will just have a dangling channelDatabaseId after the drop, until the FK is re-attached. The comment says "rows are preserved" — this is technically true but could confuse someone into thinking referential integrity is maintained during the window between the DROP and the FK re-attach. A more precise comment would note: "rows survive but without a FK constraint until step below."

This is a correctness gap only during the migration window, and since the whole operation runs in a single migration call it's low risk. Still worth clarifying the comment.

Minor clarity issue.

4. Re-attached FK lacks a constraint name

File: 104_add_channel_database_hash.ts:139-145

ALTER TABLE channel_database_permissions
  ADD FOREIGN KEY ("channelDatabaseId")
      REFERENCES channel_database(id)
      ON DELETE CASCADE

The original FK constraint (created in an earlier migration) likely had a name (e.g., channel_database_permissions_channelDatabaseId_fkey). After CASCADE drop and re-add, the new constraint will get an autogenerated name that may differ from the original. This is generally harmless but could cause issues if any migration, test, or admin script references the constraint by name.

Recommendation: Check the original constraint name from the schema migration that created channel_database_permissions and use ADD CONSTRAINT <name> FOREIGN KEY ... to restore it consistently.

Low-medium priority.

5. No transaction wrapping around the rebuild

File: 104_add_channel_database_hash.ts:66-147

The rebuildChannelDatabasePostgres function performs ~9 DDL/DML operations. If the migration fails partway through (e.g., network blip, OOM, crash), the database could be left in an inconsistent state: channel_database may have been dropped but channel_database_new not yet renamed. The DROP TABLE IF EXISTS channel_database_new at the top covers cleanup of the scratch table, but it doesn't cover the case where channel_database itself was already dropped.

Ideally the entire rebuild would be wrapped in a transaction (PostgreSQL supports transactional DDL). The calling migration runner would need to supply a transactional client for this, which may not be possible given the current runner architecture. At minimum, the function could document this non-atomicity as a known risk.

Medium priority — worth at least a code comment noting the caveat.

6. canonicalCols list does not include channelHash — correct, but subtly so

File: 104_add_channel_database_hash.ts:103-108

const canonicalCols = [
  'name', 'psk', 'pskLength', 'description', 'isEnabled',
  'enforceNameValidation', 'sortOrder', 'decryptedPacketCount',
  'lastDecryptedAt', 'createdBy', 'createdAt', 'updatedAt',
];
const colsToCopy = canonicalCols.filter((c) => liveCols.has(c));

channelHash is intentionally excluded because it won't exist in the old table (that's the whole point of the migration). The new table has it with a NULL default, so migrated rows simply get NULL. This is correct. A comment here would make the intent obvious — e.g., // channelHash is new; old table won't have it, new table defaults to NULL.

Nit.


Test Coverage

The 3 existing SQLite tests are solid:

  • Column creation ✓
  • Idempotency ✓
  • NULL default + value round-trip ✓

However, the PostgreSQL rebuild path has zero automated test coverage in this PR, which is explicitly acknowledged in the test file header. Given that this is the most complex part of the fix (table rebuild, sequence reset, FK re-attachment), it carries the highest risk. The PR notes it "requires a live PostgreSQL connection."

Recommendation: Consider adding integration tests for the PostgreSQL path using a pg test helper (similar to patterns likely already existing in the repo for other PG migrations), or at minimum document the manual verification steps performed. The tombstoneCount >= 1500 branch in particular is hard to verify without a test.


Verdict

The fix is correct and addresses a real, impactful production crash. The approach (detect tombstone saturation, rebuild table, copy data, restore FK) is sound. The issues noted above are mostly minor clarity/robustness concerns; the only medium-priority item is the lack of transactional wrapping and the constraint name consistency. The missing PostgreSQL integration tests are the most notable gap for a change this complex.

Approved with suggestions — the medium-priority items are worth addressing before merge if feasible.


@Yeraze Yeraze merged commit b7c1e07 into main Jun 25, 2026
19 checks passed
@Yeraze Yeraze deleted the claude/great-dijkstra-v95lft branch June 25, 2026 20:24
Yeraze added a commit that referenced this pull request Jun 26, 2026
…d atomic

Release-audit (PR #3753) flagged the tombstone-exhaustion rebuild path as a
data-availability risk: it DROPs the live channel_database and RENAMEs a fresh
copy into place, but the migration runner wraps each migration's statements in
no transaction, so each DDL auto-commits. A crash/connection drop in the
DROP→RENAME window would leave the database with no channel_database — and the
migration's own idempotency check (`'channel_database'::regclass`) would then
throw "relation does not exist" on the next startup, blocking recovery.

Wrap the rebuild in BEGIN/COMMIT with ROLLBACK on error. PostgreSQL DDL is
transactional, so any failure now rolls back to the original table intact.
(Channel keys/PSKs were already copied before the drop, so this hardens the
narrow availability window rather than fixing data loss. The path only fires at
≥1500 attnum tombstones — 1500+ restarts on the old migration 021 bug — so it's
rare, but the blast radius warranted the fix.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_011JEaCGwY9Wz8jeV4e22GW4
Yeraze added a commit that referenced this pull request Jun 26, 2026
* chore(release): bump version to 4.12.0

Finalize 4.12.0 (from 4.12.0-rc2): package.json, package-lock.json,
desktop/package.json, desktop/src-tauri/tauri.conf.json, helm/Chart.yaml.
Promote CHANGELOG [Unreleased] → [4.12.0].

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_011JEaCGwY9Wz8jeV4e22GW4

* docs(4.12.0): document Automation Engine, MeshCore regions/scopes & UI additions

Release-audit of every PR since v4.11.5 surfaced documentation gaps. Brings the
docs site current with the 4.12.0 feature set:

- New page docs/features/automation-engine.md (+ sidebar nav) for the node/
  builder-based Automation Engine (#3653 et al). v-pre-wrapped so the {{ }}
  token syntax renders literally without breaking the VitePress build.
- meshcore.md: new Regions/Scopes section (#3667 et al), channel unread
  indicators, heard-repeaters badge, per-message route line, byte counter,
  on-demand telemetry polling, Define Path editor, repeater DM-hidden behavior.
- automation.md: corrected stale auto-favorite eligibility table (#3786/#3774).
- maps.md: traceroute default-keyed channel routing + channel dropdowns (#3723).
- map-analysis.md: rich multi-source node popups (#3692).
- device.md: unmessagable-node DM-hidden behavior (#3760).

Verified: vitepress build passes; tokens render literally.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_011JEaCGwY9Wz8jeV4e22GW4

* fix(migration): make migration 104 PostgreSQL channel_database rebuild atomic

Release-audit (PR #3753) flagged the tombstone-exhaustion rebuild path as a
data-availability risk: it DROPs the live channel_database and RENAMEs a fresh
copy into place, but the migration runner wraps each migration's statements in
no transaction, so each DDL auto-commits. A crash/connection drop in the
DROP→RENAME window would leave the database with no channel_database — and the
migration's own idempotency check (`'channel_database'::regclass`) would then
throw "relation does not exist" on the next startup, blocking recovery.

Wrap the rebuild in BEGIN/COMMIT with ROLLBACK on error. PostgreSQL DDL is
transactional, so any failure now rolls back to the original table intact.
(Channel keys/PSKs were already copied before the drop, so this hardens the
narrow availability window rather than fixing data loss. The path only fires at
≥1500 attnum tombstones — 1500+ restarts on the old migration 021 bug — so it's
rare, but the blast radius warranted the fix.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_011JEaCGwY9Wz8jeV4e22GW4

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] [ERROR] ❌ Failed to migrate auto-responder triggers: error: tables can have at most 1600 columns

2 participants