Skip to content

v0.42.36.0 fix(sync): resumable, durable, single-flight sync — converges under pool exhaustion + repeated kills (#1794)#1980

Merged
garrytan merged 10 commits into
masterfrom
garrytan/sync-checkpoint-1794
Jun 8, 2026
Merged

v0.42.36.0 fix(sync): resumable, durable, single-flight sync — converges under pool exhaustion + repeated kills (#1794)#1980
garrytan merged 10 commits into
masterfrom
garrytan/sync-checkpoint-1794

Conversation

@garrytan

@garrytan garrytan commented Jun 8, 2026

Copy link
Copy Markdown
Owner

What & why

A large gbrain sync that overran its launching session's timeout (SIGTERM) lost 100% of its progress and re-imported the whole backlog every run — never converging, burning CPU for hours while the source went quietly stale (#1794). The resumable checkpoint shipped in v0.42.17.0 had holes in exactly that failure mode, and #1794 recurred live. This PR closes all of them: sync is now resumable, durable, and single-flight.

How

Layer 1 — durability

  • Checkpoint reads/writes route through the direct session pool + bounded withRetry, so they survive EMAXCONNSESSION / too_many_connections (now classified retryable). Previously the write swallowed pool-exhaustion errors and banked nothing.
  • Guaranteed final flush on every exit path: cooperative timeout (async partial() flush), external SIGTERM (one-shot no-retry flush via registerCleanup, ordered before lock release), and clean completion.
  • Fail-loud: after N consecutive failed flushes the run aborts with a checkpoint_unavailable partial instead of importing unbankable work. Every partial/blocked exit logs how many files were banked.
  • First-file + time-based (~10s) flush cadence; race-safe pendingCheckpointPaths delta under parallel workers (swap + re-merge on failure).

Layer 2 — append-only storage (op_checkpoint_paths, migration v115)

  • One row per drained path via a single writable-CTE unnest($3::text[]) write — O(delta), not the old O(N²) full-array rewrite. recordCompleted keeps replace semantics for the 9 non-sync consumers; sync uses the additive appendCompleted. loadOpCheckpoint unions legacy completed_keys + child rows (UNION ALL, JS-deduped).

Layer 3 — lock heartbeat + single-flight

  • Event-loop yield (setTimeout(0), not setImmediate — Bun starves the timers phase) keeps the lock-refresh setInterval heartbeat alive mid-import.
  • Refresh + its health probe route through the direct pool (a read-pool probe failure no longer stops renewal).
  • Heartbeat-aware takeover: a holder that refreshed within a derived grace window is not stolen even if its TTL lapsed.
  • Bare gbrain sync (no --source) now uses withRefreshingLock; a cron sync that collides with a running one is a skip, not a phase failure.

Test plan

  • New/updated: op-checkpoint.test.ts (append delta, union read, cascade clear, purge), db-lock-heartbeat-takeover.test.ts (grace protection, stale/NULL steal, direct-refresh bump, yield mechanism), retry-matcher.test.ts (EMAXCONNSESSION/53300), sync-resumable-import.serial.test.ts (still green on the append rewrite).
  • Verification gate (post-merge): 266 pass / 0 fail across op-checkpoint, resumable-sync, db-lock×3, migrate, schema-bootstrap, retry-matcher, build-llms. Typecheck clean. JSONB + batch-audit + progress-stdout guards green.
  • Note: run the suite sharded (bun run test); a monolithic bun test OOMs PGLite's WASM runtime.

Review

Plan went through /plan-eng-review + cross-model outside voice (Codex), which caught a critical bug pre-implementation: routing only the lock refresh through the direct pool was insufficient because the read-pool health probe still gated renewal (fixed). Closes #1794.

🤖 Generated with Claude Code

garrytan and others added 10 commits June 8, 2026 05:44
Durable append-only checkpoint writes (executeRawDirect + retry), fail-loud
consecutive-failure abort, first-file/10s flush cadence, race-safe pending-delta
under parallel workers, guaranteed final flush on every exit path incl. SIGTERM
(no-retry one-shot via registerCleanup), bankedFiles/reason observability,
event-loop yield to keep the lock heartbeat alive, and routing the bare
(no-source) sync through withRefreshingLock.
…point-1794

# Conflicts:
#	src/commands/sync.ts
#	src/core/migrate.ts
…ges under pool exhaustion + repeated kills (#1794)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@garrytan garrytan merged commit 959af10 into master Jun 8, 2026
19 of 21 checks passed
mgunnin added a commit to mgunnin/gbrain that referenced this pull request Jun 9, 2026
* upstream/master:
  v0.42.37.0 fix(security,ingest): source-isolation grant enforcement + non-string frontmatter guard + papercuts (garrytan#1999)
  v0.42.36.0 fix(sync): resumable, durable, single-flight sync — converges under pool exhaustion + repeated kills (garrytan#1794) (garrytan#1980)
  v0.42.35.0 fix(sync): recover from unreachable last_commit instead of full-walking forever (garrytan#1970) (garrytan#1975)
  v0.42.34.0 feat(search): typed-edge relational retrieval — relationship questions get relationship answers (garrytan#1959)
  docs(designs): add COMMUNITY_IDEAS ledger from open-PR backlog triage (garrytan#1969)
  v0.42.33.0 fix(sources): confine sync re-clone to gbrain-owned clones; never delete a user working tree (garrytan#1881) (garrytan#1960)

# Conflicts:
#	src/core/operations.ts
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Large-source full sync never converges: killed mid-import loses all progress (last_commit only advances on completion) + backlog outruns each attempt

1 participant