v0.42.17.0 fix(sync): resumable incremental sync — killed mid-import no longer loses progress (#1794)#1808
Merged
Merged
Conversation
- op-checkpoint.ts: syncFingerprint({sourceId, lastCommit}) keyed on the
anchor (never HEAD) so the checkpoint survives a growing backlog.
- source-health.ts: commitTimeMs(localPath, sha) for stamping
newest_content_at against a pinned (non-HEAD) commit.
- sync-concurrency.ts: resolveMaxConnections + clampWorkersForConnectionBudget
for the opt-in GBRAIN_MAX_CONNECTIONS single-sync footprint clamp.
…1794) performSyncInner now drains a fixed lastCommit..pin range, banking completed file paths to op_checkpoints and advancing last_commit (+ last_sync_at) ONLY at full import completion. A killed/aborted/blocked run leaves the anchor untouched and resumes from the banked set next run — the convergence fix. - Pinned target: completion advances to the pin, not live HEAD, so commits landing after the pin are a clean next-sync diff (kills the staleness window). History rewrite (pin not an ancestor of HEAD) discards the checkpoint + re-pins. - Forward-progress head gate: merge-base --is-ancestor pin HEAD replaces the strict "HEAD == captured" gate that blocked on every concurrent enrich commit. - Vanished-on-disk added file -> skip + checkpoint, not a failedFiles block. - Large syncs defer extract/embed to the resumable --stale sweeps (convergence == import convergence); small syncs keep inline extract/facts/embed. - GBRAIN_MAX_CONNECTIONS clamp on the worker fan-out (opt-in). - Typed SyncLockBusyError; the Minion sync handler (jobs.ts) marks the job SKIPPED (not failed) on a held lock so cron/autopilot defers cleanly.
computePoolBudgetCheck + checkPoolBudget warn when the parent pool leaves no room for a parallel sync worker under GBRAIN_MAX_CONNECTIONS, pointing at GBRAIN_POOL_SIZE=2. Registered in the ops category set.
…1794) - sync-resumable-import.serial.test.ts (13 cases): convergence regression, resume-skips-checkpointed, pinned-target/forward-drift, history-rewrite re-pin, last_sync_at-not-bumped-on-block + good-file banking, vanished-file skip, dry-run/empty-diff, + pure fingerprint/clamp/pool-budget helpers. - sync-parallel.test.ts: vanished-mid-sync added file now asserts the new skip contract (supersedes the v0.22.13 CODEX-3 failedFiles behavior).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…able-checkpoint # Conflicts: # CHANGELOG.md # CLAUDE.md # VERSION # llms-full.txt # package.json
…able-checkpoint # Conflicts: # CHANGELOG.md # VERSION # package.json
mgunnin
added a commit
to mgunnin/gbrain
that referenced
this pull request
Jun 3, 2026
* upstream/master: v0.42.23.0 feat(jobs): --nice scheduling-priority flag for jobs work/supervisor (garrytan#1815) (garrytan#1820) v0.42.22.0 fix(minions): supervisor progress watchdog + worker DB self-defense — alive-but-wedged worker self-heals (garrytan#1801) (garrytan#1824) v0.42.21.0 fix(postgres): module-singleton ownership — canonical landing for the dream-cycle "connect() has not been called" class (garrytan#1404/garrytan#1471/garrytan#1619) (garrytan#1805) v0.42.20.0 fix: reliability wave — PGLite capture lock-pin + Postgres reconnect race + search embed-hang (garrytan#1762 garrytan#1745 garrytan#1775) (garrytan#1810) v0.42.19.0 fix(skillopt): close the last gap in the AI SDK v6 tool-loop fix (write-capture mapper + regression test) (garrytan#1809) v0.42.18.0 fix: sync orphan-pileup watchdog (garrytan#1633) + links-lag µs stamp (garrytan#1768) (garrytan#1807) v0.42.17.0 fix(sync): resumable incremental sync — killed mid-import no longer loses progress (garrytan#1794) (garrytan#1808) v0.42.16.0 feat(doctor): brain health as a solved problem — cause-ranked doctor + OOM-loop line + auto-drain + pool-reap (garrytan#1685) (garrytan#1802) v0.42.15.0 fix: decouple CLI primary output from process.stdout.isTTY (garrytan#1784) (garrytan#1806) v0.42.14.0 fix(zero-config): code-* readiness signal + init embedding-key validation + lock self-heal (garrytan#1780) (garrytan#1804) v0.42.13.0 fix(search): archive/ content findable by default, demoted not hard-excluded (garrytan#1777) (garrytan#1797) v0.42.12.0 feat: self-upgrading gbrain — invocation-riding update check + opt-in auto-upgrade (garrytan#1798) v0.42.11.0 feat(skillopt): held-out eval gate, honest receipts, ENFORCE + ablation opts (garrytan#1759) v0.42.10.0 feat(extract): opt-in global-basename wikilink resolution (closes garrytan#972) (garrytan#1388)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What & why (#1794)
A large
gbrain synckilled partway through used to throw away all its progress:last_commitadvanced only on full completion, so the next run re-walked a growing diff and never caught up. The reporter's case: a background enrich process added ~40K pages overnight (one page per commit), the next sync had a 44K-file backlog, it got killed at ~16% by a session timeout, banked nothing, and the next hour the backlog was bigger. Mathematically couldn't converge.This makes incremental sync resumable and fixes the two co-conspirators the review surfaced.
How
Pinned-target path-set checkpoint (commit-watermark was rejected in eng + Codex review for non-linear-history bugs):
performSyncInnerdrains a fixedlastCommit..pinrange.pinis a pinned target commit held in twoop_checkpointsrows (sync= completed file paths,sync-target=[pin]), keyed bysyncFingerprint({sourceId, lastCommit})— keyed on the anchor, never HEAD, so the checkpoint survives a backlog growing underneath it.completedPathsand flushes everyGBRAIN_SYNC_CHECKPOINT_EVERY(default 1000).last_commit+last_sync_atadvance topin(viacommitTimeMs) only at full import completion, then both checkpoint rows clear. A killed/aborted/blocked run bankscompletedand leaves the anchor untouched — so the source correctly stays "stale" and reschedules.Pinned target eliminates the staleness window (Codex #3): completion advances to
pin, not live HEAD, so commits landing after the pin are a clean next-syncpin..HEADdiff and never get skipped. A file added in-range but deleted from disk after the pin is a SKIP, not a failure. A history rewrite (pin no longer an ancestor of HEAD) discards the checkpoint and re-pins.Forward-progress head gate: the pre-existing strict
HEAD == capturedhead-drift gate — which blocked the run on any concurrent commit (exactly the "enrich commits every ~2 min" case) — is replaced bygit merge-base --is-ancestor pin HEAD. Forward progress is safe; only a real rewrite blocks.Downstream decoupled (convergence == import convergence): large syncs defer link/timeline extraction to the resumable
gbrain extract --stalewatermark + the autopilot cycle (and embedding toembed --stale/backfill) instead of running a 44K-page pass inline. Small syncs are unchanged — still inline extract/facts/embed.Connection-budget clamp (
GBRAIN_MAX_CONNECTIONS, opt-in): caps a single sync's worker fan-out so it can't exhaust a low-cap pooler (Supabase's 20-client default) and starve its own retries.gbrain doctor's newpool_budgetcheck nudges towardGBRAIN_POOL_SIZE=2when the math doesn't fit.Single-flight:
performSyncthrows a typedSyncLockBusyError; the Minionsynchandler marks the job skipped (not failed) on a held lock, so a cron/autopilot tick defers to the holder without polluting failed-job/crash metrics.Commits (bisectable)
feat(sync)— checkpoint primitives (syncFingerprint,commitTimeMs, connection-budget clamp)feat(sync)— resumableperformSyncInner(pinned target) +SyncLockBusyErrorskip in the Minion handlerfeat(doctor)—pool_budgetcheck + ops-category registrationtest(sync)— 13-case resumable-sync regression suite + vanished-file contract updatechore— version bump + CHANGELOG + CLAUDE.md + llmsTests
test/sync-resumable-import.serial.test.ts(13): convergence regression, resume-skips-checkpointed, pinned-target/forward-drift, history-rewrite re-pin,last_sync_at-not-bumped-on-block + good-file banking, vanished-file skip, dry-run/empty-diff, + pure fingerprint/clamp/pool-budget helpers.test/sync-parallel.test.tsvanished-mid-sync test updated to the new skip contract (supersedes the v0.22.13 CODEX-3 failedFiles behavior).package.jsonread during a mid-run bump) — all fixed and re-verified.bun run verify(typecheck + 29 guards) green.Reviewed
Eng review (plan-stage) + Codex outside-voice both ran in plan mode — Codex reversed the mechanism from commit-watermark to the pinned-target path-set checkpoint (caught the non-linear-history bug, the scheduler-staleness lie, and the facts-stranding gap). Plan:
~/.claude/plans/system-instruction-you-are-working-modular-kahan.md.🤖 Generated with Claude Code