Skip to content

v0.42.17.0 fix(sync): resumable incremental sync — killed mid-import no longer loses progress (#1794)#1808

Merged
garrytan merged 7 commits into
masterfrom
garrytan/sync-resumable-checkpoint
Jun 3, 2026
Merged

v0.42.17.0 fix(sync): resumable incremental sync — killed mid-import no longer loses progress (#1794)#1808
garrytan merged 7 commits into
masterfrom
garrytan/sync-resumable-checkpoint

Conversation

@garrytan

@garrytan garrytan commented Jun 3, 2026

Copy link
Copy Markdown
Owner

What & why (#1794)

A large gbrain sync killed partway through used to throw away all its progress: last_commit advanced only on full completion, so the next run re-walked a growing diff and never caught up. The reporter's case: a background enrich process added ~40K pages overnight (one page per commit), the next sync had a 44K-file backlog, it got killed at ~16% by a session timeout, banked nothing, and the next hour the backlog was bigger. Mathematically couldn't converge.

This makes incremental sync resumable and fixes the two co-conspirators the review surfaced.

How

Pinned-target path-set checkpoint (commit-watermark was rejected in eng + Codex review for non-linear-history bugs):

  • performSyncInner drains a fixed lastCommit..pin range. pin is a pinned target commit held in two op_checkpoints rows (sync = completed file paths, sync-target = [pin]), keyed by syncFingerprint({sourceId, lastCommit}) — keyed on the anchor, never HEAD, so the checkpoint survives a backlog growing underneath it.
  • The import loop resume-filters deletes/renames/adds against the banked completedPaths and flushes every GBRAIN_SYNC_CHECKPOINT_EVERY (default 1000).
  • last_commit + last_sync_at advance to pin (via commitTimeMs) only at full import completion, then both checkpoint rows clear. A killed/aborted/blocked run banks completed and leaves the anchor untouched — so the source correctly stays "stale" and reschedules.

Pinned target eliminates the staleness window (Codex #3): completion advances to pin, not live HEAD, so commits landing after the pin are a clean next-sync pin..HEAD diff and never get skipped. A file added in-range but deleted from disk after the pin is a SKIP, not a failure. A history rewrite (pin no longer an ancestor of HEAD) discards the checkpoint and re-pins.

Forward-progress head gate: the pre-existing strict HEAD == captured head-drift gate — which blocked the run on any concurrent commit (exactly the "enrich commits every ~2 min" case) — is replaced by git merge-base --is-ancestor pin HEAD. Forward progress is safe; only a real rewrite blocks.

Downstream decoupled (convergence == import convergence): large syncs defer link/timeline extraction to the resumable gbrain extract --stale watermark + the autopilot cycle (and embedding to embed --stale/backfill) instead of running a 44K-page pass inline. Small syncs are unchanged — still inline extract/facts/embed.

Connection-budget clamp (GBRAIN_MAX_CONNECTIONS, opt-in): caps a single sync's worker fan-out so it can't exhaust a low-cap pooler (Supabase's 20-client default) and starve its own retries. gbrain doctor's new pool_budget check nudges toward GBRAIN_POOL_SIZE=2 when the math doesn't fit.

Single-flight: performSync throws a typed SyncLockBusyError; the Minion sync handler marks the job skipped (not failed) on a held lock, so a cron/autopilot tick defers to the holder without polluting failed-job/crash metrics.

Commits (bisectable)

  1. feat(sync) — checkpoint primitives (syncFingerprint, commitTimeMs, connection-budget clamp)
  2. feat(sync) — resumable performSyncInner (pinned target) + SyncLockBusyError skip in the Minion handler
  3. feat(doctor)pool_budget check + ops-category registration
  4. test(sync) — 13-case resumable-sync regression suite + vanished-file contract update
  5. chore — version bump + CHANGELOG + CLAUDE.md + llms

Tests

  • New test/sync-resumable-import.serial.test.ts (13): convergence regression, resume-skips-checkpointed, pinned-target/forward-drift, history-rewrite re-pin, last_sync_at-not-bumped-on-block + good-file banking, vanished-file skip, dry-run/empty-diff, + pure fingerprint/clamp/pool-budget helpers.
  • test/sync-parallel.test.ts vanished-mid-sync test updated to the new skip contract (supersedes the v0.22.13 CODEX-3 failedFiles behavior).
  • Full unit suite: 12,923 pass; the only 3 failures were from this change (vanished contract, doctor-category registry, a torn package.json read during a mid-run bump) — all fixed and re-verified.
  • bun run verify (typecheck + 29 guards) green.

Reviewed

Eng review (plan-stage) + Codex outside-voice both ran in plan mode — Codex reversed the mechanism from commit-watermark to the pinned-target path-set checkpoint (caught the non-linear-history bug, the scheduler-staleness lie, and the facts-stranding gap). Plan: ~/.claude/plans/system-instruction-you-are-working-modular-kahan.md.

🤖 Generated with Claude Code

garrytan and others added 7 commits June 2, 2026 23:33
- op-checkpoint.ts: syncFingerprint({sourceId, lastCommit}) keyed on the
  anchor (never HEAD) so the checkpoint survives a growing backlog.
- source-health.ts: commitTimeMs(localPath, sha) for stamping
  newest_content_at against a pinned (non-HEAD) commit.
- sync-concurrency.ts: resolveMaxConnections + clampWorkersForConnectionBudget
  for the opt-in GBRAIN_MAX_CONNECTIONS single-sync footprint clamp.
…1794)

performSyncInner now drains a fixed lastCommit..pin range, banking completed
file paths to op_checkpoints and advancing last_commit (+ last_sync_at) ONLY
at full import completion. A killed/aborted/blocked run leaves the anchor
untouched and resumes from the banked set next run — the convergence fix.

- Pinned target: completion advances to the pin, not live HEAD, so commits
  landing after the pin are a clean next-sync diff (kills the staleness window).
  History rewrite (pin not an ancestor of HEAD) discards the checkpoint + re-pins.
- Forward-progress head gate: merge-base --is-ancestor pin HEAD replaces the
  strict "HEAD == captured" gate that blocked on every concurrent enrich commit.
- Vanished-on-disk added file -> skip + checkpoint, not a failedFiles block.
- Large syncs defer extract/embed to the resumable --stale sweeps (convergence
  == import convergence); small syncs keep inline extract/facts/embed.
- GBRAIN_MAX_CONNECTIONS clamp on the worker fan-out (opt-in).
- Typed SyncLockBusyError; the Minion sync handler (jobs.ts) marks the job
  SKIPPED (not failed) on a held lock so cron/autopilot defers cleanly.
computePoolBudgetCheck + checkPoolBudget warn when the parent pool leaves no
room for a parallel sync worker under GBRAIN_MAX_CONNECTIONS, pointing at
GBRAIN_POOL_SIZE=2. Registered in the ops category set.
…1794)

- sync-resumable-import.serial.test.ts (13 cases): convergence regression,
  resume-skips-checkpointed, pinned-target/forward-drift, history-rewrite
  re-pin, last_sync_at-not-bumped-on-block + good-file banking, vanished-file
  skip, dry-run/empty-diff, + pure fingerprint/clamp/pool-budget helpers.
- sync-parallel.test.ts: vanished-mid-sync added file now asserts the new
  skip contract (supersedes the v0.22.13 CODEX-3 failedFiles behavior).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…able-checkpoint

# Conflicts:
#	CHANGELOG.md
#	CLAUDE.md
#	VERSION
#	llms-full.txt
#	package.json
…able-checkpoint

# Conflicts:
#	CHANGELOG.md
#	VERSION
#	package.json
@garrytan garrytan merged commit fd2fde9 into master Jun 3, 2026
21 checks passed
mgunnin added a commit to mgunnin/gbrain that referenced this pull request Jun 3, 2026
* upstream/master:
  v0.42.23.0 feat(jobs): --nice scheduling-priority flag for jobs work/supervisor (garrytan#1815) (garrytan#1820)
  v0.42.22.0 fix(minions): supervisor progress watchdog + worker DB self-defense — alive-but-wedged worker self-heals (garrytan#1801) (garrytan#1824)
  v0.42.21.0 fix(postgres): module-singleton ownership — canonical landing for the dream-cycle "connect() has not been called" class (garrytan#1404/garrytan#1471/garrytan#1619) (garrytan#1805)
  v0.42.20.0 fix: reliability wave — PGLite capture lock-pin + Postgres reconnect race + search embed-hang (garrytan#1762 garrytan#1745 garrytan#1775) (garrytan#1810)
  v0.42.19.0 fix(skillopt): close the last gap in the AI SDK v6 tool-loop fix (write-capture mapper + regression test) (garrytan#1809)
  v0.42.18.0 fix: sync orphan-pileup watchdog (garrytan#1633) + links-lag µs stamp (garrytan#1768) (garrytan#1807)
  v0.42.17.0 fix(sync): resumable incremental sync — killed mid-import no longer loses progress (garrytan#1794) (garrytan#1808)
  v0.42.16.0 feat(doctor): brain health as a solved problem — cause-ranked doctor + OOM-loop line + auto-drain + pool-reap (garrytan#1685) (garrytan#1802)
  v0.42.15.0 fix: decouple CLI primary output from process.stdout.isTTY (garrytan#1784) (garrytan#1806)
  v0.42.14.0 fix(zero-config): code-* readiness signal + init embedding-key validation + lock self-heal (garrytan#1780) (garrytan#1804)
  v0.42.13.0 fix(search): archive/ content findable by default, demoted not hard-excluded (garrytan#1777) (garrytan#1797)
  v0.42.12.0 feat: self-upgrading gbrain — invocation-riding update check + opt-in auto-upgrade (garrytan#1798)
  v0.42.11.0 feat(skillopt): held-out eval gate, honest receipts, ENFORCE + ablation opts (garrytan#1759)
  v0.42.10.0 feat(extract): opt-in global-basename wikilink resolution (closes garrytan#972) (garrytan#1388)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant