fix(migration): rebuild channel_database on PostgreSQL tombstone exhaustion (#3751) by Yeraze · Pull Request #3753 · Yeraze/meshmonitor

Yeraze · 2026-06-25T16:07:59Z

Summary

Fixes #3751 — PostgreSQL crashes on startup with tables can have at most 1600 columns during migration 104.

Root cause

PostgreSQL hard-limits every table to 1600 total attnum slots (column "tombstones"), counting every ADD COLUMN ever executed, including columns that were later dropped. An older version of migration 021 included channel_database in its ADD COLUMN loop. Migration 063 drops sourceId from channel_database. Because PostgreSQL re-runs all migrations on every startup (no completion tracking, unlike SQLite's settingsKey), this created a one-tombstone-per-restart cycle:

Startup N:     migration 021 → ADD COLUMN sourceId    (+1 tombstone)
               migration 063 → DROP COLUMN sourceId   (slot consumed)
Startup N+1:   migration 021 → ADD COLUMN sourceId    (+1 tombstone)
               ...
Startup ~1600: migration 104 → ADD COLUMN channelHash → ERROR

The bug in migration 021 was already fixed (it now explicitly excludes channel_database), and no new tombstones accumulate after that fix. However, users who accumulated ≥1600 tombstones before upgrading hit this error when migration 104 tries to add channelHash.

Fix

Migration 104's PostgreSQL path now:

Short-circuits if channelHash is already a live column (full idempotency on repeat startups).
Checks total attnum count (live + tombstones) via pg_attribute.
If count ≥ 1500 — rebuilds channel_database from scratch:
- Creates channel_database_new with the canonical schema (including channelHash).
- Discovers live columns dynamically via pg_attribute and copies only those — safe regardless of which optional columns a deployment has.
- Resets the SERIAL sequence to max(id).
- DROP TABLE channel_database CASCADE — automatically removes the FK constraint on channel_database_permissions (data is preserved).
- Renames new table and sequence into place.
- Re-attaches the FK on channel_database_permissions.
Falls through to the normal ADD COLUMN IF NOT EXISTS when the table is healthy.

No data loss — all channel keys and settings are copied during the rebuild.

Test plan

All 3 existing SQLite migration 104 tests still pass
Full Vitest suite: 7501 passed, 3 failed — the 3 failures are pre-existing (mqttBrokerManager.test.ts) unrelated to this change
Tombstone detection and rebuild logic reviewed for correctness across normal, already-migrated, and exhausted cases
CI suite

🤖 Generated with Claude Code

https://claude.ai/code/session_013T3CWviKTCWzzQ5kvst2T4

Generated by Claude Code

…ustion (#3751) PostgreSQL hard-limits every table to 1600 total column "tombstones" (attnum slots, including previously dropped columns). An older version of migration 021 added `sourceId` to `channel_database`; migration 063 drops it. Because PostgreSQL has no migration-completion tracking and re-runs all migrations on every startup, this ADD/DROP cycle consumed one tombstone slot per restart. Users with ≥1600 restarts on the old code hit the limit when migration 104 tries to add `channelHash`. Fix: migration 104's PostgreSQL path now: 1. Short-circuits if `channelHash` already exists (full idempotency). 2. Queries `pg_attribute` for total attnum count (live + tombstones). 3. When the count is ≥1500, rebuilds `channel_database` from scratch: - Creates `channel_database_new` with the canonical schema (incl. `channelHash`). - Copies only the live columns from the old table (discovered dynamically via `pg_attribute`), so the query is safe regardless of which optional columns a given deployment has. - Resets the SERIAL sequence. - Drops the old table (CASCADE removes the FK from `channel_database_permissions` automatically). - Renames the new table + sequence into place. - Re-attaches the FK on `channel_database_permissions`. 4. Falls through to the normal `ADD COLUMN IF NOT EXISTS` when no tombstone issue exists. No data loss — all channel keys and settings are copied. The root cause (no PG migration-completion tracking) is tracked separately; this fix ensures affected users can start again after upgrading. Closes #3751 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_013T3CWviKTCWzzQ5kvst2T4

github-actions · 2026-06-25T16:08:17Z

Claude finished @Yeraze's task —— View job

PR Review: fix(migration): rebuild channel_database on PostgreSQL tombstone exhaustion

Read migration implementation file
Read migration test file
Analyze code quality, bugs, performance, security
Post comprehensive review feedback

Summary

This is a well-reasoned fix for a subtle but serious PostgreSQL crash path. The root cause analysis is accurate (migration 021/063 ADD/DROP cycle consuming one attnum per restart), and the chosen fix (table rebuild) is the correct approach — there is no in-place way to reclaim tombstoned attnums in PostgreSQL. The code is generally high quality. I have a few observations below, mostly minor.

Issues / Observations

1. Sequence reset sets value to `MAX(id)` — off by one on next insert

File: 104_add_channel_database_hash.ts:116-121

SELECT setval(
  'channel_database_new_id_seq',
  COALESCE((SELECT MAX(id) FROM channel_database_new), 0)
)

setval(seq, n) sets the last-returned value to n, so the next nextval() will return n + 1. This is correct behavior and matches the intent ("advance past the highest copied id"). However, there's an edge case: if channel_database_new is empty (no rows), COALESCE returns 0, and setval(seq, 0) will make the next insert get id=1 — which is fine. No bug here, but worth a comment for the next reader since the idiom is sometimes confused with the two-argument form setval(seq, n, false).

Low priority / doc nit.

2. Tombstone threshold of 1500 is somewhat arbitrary — consider documenting the margin

File: 104_add_channel_database_hash.ts:177

if (tombstoneCount >= 1500) {

The hard PostgreSQL limit is 1600, so 1500 gives a 100-slot buffer. That's probably fine for this table (future migrations are unlikely to add 100 more columns to channel_database), but it would be helpful to add a brief inline comment explaining the chosen value, e.g.:

// Rebuild if within 100 of PostgreSQL's 1600 attnum hard limit.
if (tombstoneCount >= 1500) {

Nit.

3. `DROP TABLE channel_database CASCADE` comment is slightly misleading

File: 104_add_channel_database_hash.ts:123-125

// DROP CASCADE removes all FK constraints in referencing tables
// (rows in channel_database_permissions are preserved).
await client.query(`DROP TABLE channel_database CASCADE`);

DROP TABLE ... CASCADE drops dependent objects (FK constraints, views), but it does not drop the rows in referencing tables — rows in channel_database_permissions pointing at the now-dropped table will just have a dangling channelDatabaseId after the drop, until the FK is re-attached. The comment says "rows are preserved" — this is technically true but could confuse someone into thinking referential integrity is maintained during the window between the DROP and the FK re-attach. A more precise comment would note: "rows survive but without a FK constraint until step below."

This is a correctness gap only during the migration window, and since the whole operation runs in a single migration call it's low risk. Still worth clarifying the comment.

Minor clarity issue.

4. Re-attached FK lacks a constraint name

File: 104_add_channel_database_hash.ts:139-145

ALTER TABLE channel_database_permissions
  ADD FOREIGN KEY ("channelDatabaseId")
      REFERENCES channel_database(id)
      ON DELETE CASCADE

The original FK constraint (created in an earlier migration) likely had a name (e.g., channel_database_permissions_channelDatabaseId_fkey). After CASCADE drop and re-add, the new constraint will get an autogenerated name that may differ from the original. This is generally harmless but could cause issues if any migration, test, or admin script references the constraint by name.

Recommendation: Check the original constraint name from the schema migration that created channel_database_permissions and use ADD CONSTRAINT <name> FOREIGN KEY ... to restore it consistently.

Low-medium priority.

5. No transaction wrapping around the rebuild

File: 104_add_channel_database_hash.ts:66-147

The rebuildChannelDatabasePostgres function performs ~9 DDL/DML operations. If the migration fails partway through (e.g., network blip, OOM, crash), the database could be left in an inconsistent state: channel_database may have been dropped but channel_database_new not yet renamed. The DROP TABLE IF EXISTS channel_database_new at the top covers cleanup of the scratch table, but it doesn't cover the case where channel_database itself was already dropped.

Ideally the entire rebuild would be wrapped in a transaction (PostgreSQL supports transactional DDL). The calling migration runner would need to supply a transactional client for this, which may not be possible given the current runner architecture. At minimum, the function could document this non-atomicity as a known risk.

Medium priority — worth at least a code comment noting the caveat.

6. `canonicalCols` list does not include `channelHash` — correct, but subtly so

File: 104_add_channel_database_hash.ts:103-108

const canonicalCols = [
  'name', 'psk', 'pskLength', 'description', 'isEnabled',
  'enforceNameValidation', 'sortOrder', 'decryptedPacketCount',
  'lastDecryptedAt', 'createdBy', 'createdAt', 'updatedAt',
];
const colsToCopy = canonicalCols.filter((c) => liveCols.has(c));

channelHash is intentionally excluded because it won't exist in the old table (that's the whole point of the migration). The new table has it with a NULL default, so migrated rows simply get NULL. This is correct. A comment here would make the intent obvious — e.g., // channelHash is new; old table won't have it, new table defaults to NULL.

Nit.

Test Coverage

The 3 existing SQLite tests are solid:

Column creation ✓
Idempotency ✓
NULL default + value round-trip ✓

However, the PostgreSQL rebuild path has zero automated test coverage in this PR, which is explicitly acknowledged in the test file header. Given that this is the most complex part of the fix (table rebuild, sequence reset, FK re-attachment), it carries the highest risk. The PR notes it "requires a live PostgreSQL connection."

Recommendation: Consider adding integration tests for the PostgreSQL path using a pg test helper (similar to patterns likely already existing in the repo for other PG migrations), or at minimum document the manual verification steps performed. The tombstoneCount >= 1500 branch in particular is hard to verify without a test.

Verdict

The fix is correct and addresses a real, impactful production crash. The approach (detect tombstone saturation, rebuild table, copy data, restore FK) is sound. The issues noted above are mostly minor clarity/robustness concerns; the only medium-priority item is the lack of transactional wrapping and the constraint name consistency. The missing PostgreSQL integration tests are the most notable gap for a change this complex.

Approved with suggestions — the medium-priority items are worth addressing before merge if feasible.

…d atomic Release-audit (PR #3753) flagged the tombstone-exhaustion rebuild path as a data-availability risk: it DROPs the live channel_database and RENAMEs a fresh copy into place, but the migration runner wraps each migration's statements in no transaction, so each DDL auto-commits. A crash/connection drop in the DROP→RENAME window would leave the database with no channel_database — and the migration's own idempotency check (`'channel_database'::regclass`) would then throw "relation does not exist" on the next startup, blocking recovery. Wrap the rebuild in BEGIN/COMMIT with ROLLBACK on error. PostgreSQL DDL is transactional, so any failure now rolls back to the original table intact. (Channel keys/PSKs were already copied before the drop, so this hardens the narrow availability window rather than fixing data loss. The path only fires at ≥1500 attnum tombstones — 1500+ restarts on the old migration 021 bug — so it's rare, but the blast radius warranted the fix.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011JEaCGwY9Wz8jeV4e22GW4

* chore(release): bump version to 4.12.0 Finalize 4.12.0 (from 4.12.0-rc2): package.json, package-lock.json, desktop/package.json, desktop/src-tauri/tauri.conf.json, helm/Chart.yaml. Promote CHANGELOG [Unreleased] → [4.12.0]. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011JEaCGwY9Wz8jeV4e22GW4 * docs(4.12.0): document Automation Engine, MeshCore regions/scopes & UI additions Release-audit of every PR since v4.11.5 surfaced documentation gaps. Brings the docs site current with the 4.12.0 feature set: - New page docs/features/automation-engine.md (+ sidebar nav) for the node/ builder-based Automation Engine (#3653 et al). v-pre-wrapped so the {{ }} token syntax renders literally without breaking the VitePress build. - meshcore.md: new Regions/Scopes section (#3667 et al), channel unread indicators, heard-repeaters badge, per-message route line, byte counter, on-demand telemetry polling, Define Path editor, repeater DM-hidden behavior. - automation.md: corrected stale auto-favorite eligibility table (#3786/#3774). - maps.md: traceroute default-keyed channel routing + channel dropdowns (#3723). - map-analysis.md: rich multi-source node popups (#3692). - device.md: unmessagable-node DM-hidden behavior (#3760). Verified: vitepress build passes; tokens render literally. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011JEaCGwY9Wz8jeV4e22GW4 * fix(migration): make migration 104 PostgreSQL channel_database rebuild atomic Release-audit (PR #3753) flagged the tombstone-exhaustion rebuild path as a data-availability risk: it DROPs the live channel_database and RENAMEs a fresh copy into place, but the migration runner wraps each migration's statements in no transaction, so each DDL auto-commits. A crash/connection drop in the DROP→RENAME window would leave the database with no channel_database — and the migration's own idempotency check (`'channel_database'::regclass`) would then throw "relation does not exist" on the next startup, blocking recovery. Wrap the rebuild in BEGIN/COMMIT with ROLLBACK on error. PostgreSQL DDL is transactional, so any failure now rolls back to the original table intact. (Channel keys/PSKs were already copied before the drop, so this hardens the narrow availability window rather than fixing data loss. The path only fires at ≥1500 attnum tombstones — 1500+ restarts on the old migration 021 bug — so it's rare, but the blast radius warranted the fix.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_011JEaCGwY9Wz8jeV4e22GW4 --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Yeraze merged commit b7c1e07 into main Jun 25, 2026
19 checks passed

Yeraze deleted the claude/great-dijkstra-v95lft branch June 25, 2026 20:24

Yeraze mentioned this pull request Jun 26, 2026

release: 4.12.0 — version bump + PR audit & docs-site catch-up #3802

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(migration): rebuild channel_database on PostgreSQL tombstone exhaustion (#3751)#3753

fix(migration): rebuild channel_database on PostgreSQL tombstone exhaustion (#3751)#3753
Yeraze merged 1 commit into
mainfrom
claude/great-dijkstra-v95lft

Yeraze commented Jun 25, 2026

Uh oh!

github-actions Bot commented Jun 25, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Yeraze commented Jun 25, 2026

Summary

Root cause

Fix

Test plan

Uh oh!

github-actions Bot commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: fix(migration): rebuild channel_database on PostgreSQL tombstone exhaustion

Summary

Issues / Observations

1. Sequence reset sets value to MAX(id) — off by one on next insert

2. Tombstone threshold of 1500 is somewhat arbitrary — consider documenting the margin

3. DROP TABLE channel_database CASCADE comment is slightly misleading

4. Re-attached FK lacks a constraint name

5. No transaction wrapping around the rebuild

6. canonicalCols list does not include channelHash — correct, but subtly so

Test Coverage

Verdict

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions Bot commented Jun 25, 2026 •

edited

Loading

1. Sequence reset sets value to `MAX(id)` — off by one on next insert

3. `DROP TABLE channel_database CASCADE` comment is slightly misleading

6. `canonicalCols` list does not include `channelHash` — correct, but subtly so