v0.42.36.0 fix(sync): resumable, durable, single-flight sync — converges under pool exhaustion + repeated kills (#1794)#1980
Merged
Conversation
Durable append-only checkpoint writes (executeRawDirect + retry), fail-loud consecutive-failure abort, first-file/10s flush cadence, race-safe pending-delta under parallel workers, guaranteed final flush on every exit path incl. SIGTERM (no-retry one-shot via registerCleanup), bankedFiles/reason observability, event-loop yield to keep the lock heartbeat alive, and routing the bare (no-source) sync through withRefreshingLock.
…point-1794 # Conflicts: # src/commands/sync.ts # src/core/migrate.ts
…ges under pool exhaustion + repeated kills (#1794) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…le-checkpoint current state (#1794)
mgunnin
added a commit
to mgunnin/gbrain
that referenced
this pull request
Jun 9, 2026
* upstream/master: v0.42.37.0 fix(security,ingest): source-isolation grant enforcement + non-string frontmatter guard + papercuts (garrytan#1999) v0.42.36.0 fix(sync): resumable, durable, single-flight sync — converges under pool exhaustion + repeated kills (garrytan#1794) (garrytan#1980) v0.42.35.0 fix(sync): recover from unreachable last_commit instead of full-walking forever (garrytan#1970) (garrytan#1975) v0.42.34.0 feat(search): typed-edge relational retrieval — relationship questions get relationship answers (garrytan#1959) docs(designs): add COMMUNITY_IDEAS ledger from open-PR backlog triage (garrytan#1969) v0.42.33.0 fix(sources): confine sync re-clone to gbrain-owned clones; never delete a user working tree (garrytan#1881) (garrytan#1960) # Conflicts: # src/core/operations.ts
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What & why
A large
gbrain syncthat overran its launching session's timeout (SIGTERM) lost 100% of its progress and re-imported the whole backlog every run — never converging, burning CPU for hours while the source went quietly stale (#1794). The resumable checkpoint shipped in v0.42.17.0 had holes in exactly that failure mode, and #1794 recurred live. This PR closes all of them: sync is now resumable, durable, and single-flight.How
Layer 1 — durability
withRetry, so they surviveEMAXCONNSESSION/too_many_connections(now classified retryable). Previously the write swallowed pool-exhaustion errors and banked nothing.partial()flush), external SIGTERM (one-shot no-retry flush viaregisterCleanup, ordered before lock release), and clean completion.checkpoint_unavailablepartial instead of importing unbankable work. Every partial/blocked exit logs how many files were banked.~10s) flush cadence; race-safependingCheckpointPathsdelta under parallel workers (swap + re-merge on failure).Layer 2 — append-only storage (
op_checkpoint_paths, migration v115)unnest($3::text[])write — O(delta), not the old O(N²) full-array rewrite.recordCompletedkeeps replace semantics for the 9 non-sync consumers; sync uses the additiveappendCompleted.loadOpCheckpointunions legacycompleted_keys+ child rows (UNION ALL, JS-deduped).Layer 3 — lock heartbeat + single-flight
setTimeout(0), notsetImmediate— Bun starves the timers phase) keeps the lock-refreshsetIntervalheartbeat alive mid-import.gbrain sync(no--source) now useswithRefreshingLock; a cron sync that collides with a running one is a skip, not a phase failure.Test plan
op-checkpoint.test.ts(append delta, union read, cascade clear, purge),db-lock-heartbeat-takeover.test.ts(grace protection, stale/NULL steal, direct-refresh bump, yield mechanism),retry-matcher.test.ts(EMAXCONNSESSION/53300),sync-resumable-import.serial.test.ts(still green on the append rewrite).bun run test); a monolithicbun testOOMs PGLite's WASM runtime.Review
Plan went through
/plan-eng-review+ cross-model outside voice (Codex), which caught a critical bug pre-implementation: routing only the lock refresh through the direct pool was insufficient because the read-pool health probe still gated renewal (fixed). Closes #1794.🤖 Generated with Claude Code