RFC: Parallel Chunked Sync for Large Federated Brains#1472
Closed
garrytan-agents wants to merge 1 commit into
Closed
RFC: Parallel Chunked Sync for Large Federated Brains#1472garrytan-agents wants to merge 1 commit into
garrytan-agents wants to merge 1 commit into
Conversation
Documents three compounding failure modes in the sync pipeline: 1. Serial timeout cascade (monolithic sync --all times out, kills all sources) 2. Lock contention between cron waves (stale locks from timed-out runs) 3. Embed phase blocks sync phase (inline embed eats timeout budget) Proposes four additive changes: - --timeout flag for graceful self-termination - --break-lock-if-stale for cron self-healing - --independent mode for per-source process isolation - Deprecate monolithic sync --all as default cron path Production data: 8/12 sync runs timed out in last 24h, sources going 50+ hours stale.
10 tasks
Owner
|
Closing — the actual fix shipped in PR #1506 as v0.41.15.0. Your RFC was correct in every diagnosis. The shipped plan reduced scope from your 4 surfaces to 2 surfaces + a documented shell-level cron pattern, after iteration with codex over 3 review passes on cascade-resilience and lock-stealing invariants. Specifically:
Credit for surfacing the production data lives in the v0.41.15.0 CHANGELOG entry. Closing as superseded. |
garrytan
added a commit
that referenced
this pull request
May 26, 2026
…1472 RFC) (#1506) * feat(sync): migration v98 last_refreshed_at + deleteLockRowIfStale helper Schema foundation for v0.41.15.0's `gbrain sync --break-lock --max-age <s>` flag. Adds `gbrain_cycle_locks.last_refreshed_at TIMESTAMPTZ` as the heartbeat signal that distinguishes wedged-but-alive lock holders from healthy long-running syncs that are actively refreshing. Why last_refreshed_at not acquired_at: `withRefreshingLock` already bumps `ttl_expires_at` every ~5 min while work runs, but leaves `acquired_at` at the original timestamp. A 35-min media-corpus sync that's healthy has `acquired_at` 35 min ago but `last_refreshed_at` 30 seconds ago. Using acquired_at for --max-age would steal healthy locks; last_refreshed_at correctly identifies only holders whose JS interval has stopped firing. D-V4-1 rollout safety: migration v98 backfills `last_refreshed_at = NOW()` (NOT `= acquired_at`) so pre-upgrade holders running the old binary get a 30-min protection window. After that window all pre-upgrade syncs are either complete (lock released) OR genuinely wedged (--max-age does the right thing). Documented as a known caveat in CHANGELOG. D-V4-mech-4 SQL cast: deleteLockRowIfStale uses `$N * INTERVAL '1 second'` not `$N::interval` (Postgres does not cast integer to interval the latter way). Atomic DELETE keyed on (id, holder_pid, last_refreshed_at < NOW() - $N * INTERVAL '1 second') RETURNING id, last_refreshed_at — no TOCTOU between inspect + delete. D-V4-mech-3 schema-snapshot parity: column added to all 3 snapshots so fresh init paths (pglite-schema.ts, schema.sql) initialize correctly without depending on the migration runner. schema-embedded.ts regenerated via `bun run build:schema`. Pinned by 13 PGLite cases in test/sync-break-lock-all.test.ts: tryAcquireDbLock writes on INSERT, withRefreshingLock refresh bumps both columns, inspectLock surfaces the new field, deleteLockRowIfStale refuses fresh / breaks stale / safe on holder_pid mismatch / refuses NULL (pre-v98). R1 + R6 regression invariants from the v4 plan. Closes #1472 (RFC from @garrytan-agents) — schema foundation only; performSync abort threading + CLI flags + consumer threading land in follow-up commits in this PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(sync): --timeout + --max-age + partial status + per-source AbortController The CLI surface for v0.41.15.0. Wires `gbrain sync --timeout <s>` (graceful self-termination) and `gbrain sync --break-lock --all --max-age <s>` (cron-self-heal) end-to-end through `performSync`, `runOne`, `runBreakLock`, and all `SyncResult.status` consumers. Surface 1: `gbrain sync --timeout <s>` - New `SyncOpts.signal?: AbortSignal` threads through `performSync` → `withRefreshingLock` work callback → `performSyncInner`. - D-V3-1 honest scope: abort checks fire ONLY in pre-bookmark phases (pull, delete, rename, import). Extract + embed run to completion if reached. The `last_commit` bookmark write at sync.ts:1261 is the invariant boundary — partial CANNOT advance the bookmark because the abort checkpoints sit strictly before that write. - D-V3-2 per-iteration: abort check at top of every loop iteration (delete, rename, serial import, each parallel worker's while loop) matches the per-file granularity the existing loops already have. - D-V3-3 per-source AbortController: `--timeout --all` creates ONE controller inside runOne per source so each gets its own budget; NOT a shared global controller (which would starve later sources). try/finally + timer.unref() guarantees cleanup on throw. - D-V4-mech-7 pull error.cause: pullRepo wraps execFileSync errors in GitOperationError. The catch inspects e.cause.code === 'ETIMEDOUT' and e.cause.signal === 'SIGTERM' (NOT the top-level error) to distinguish timeout (partial reason='pull_timeout') from ordinary pull failure (existing warn-and-continue, R2 invariant preserved). Surface 2: `gbrain sync --break-lock [--all] [--max-age <s>]` - Drops the --all refusal at sync.ts:1610. When combined with --all, runBreakLock iterates every active source and prints per-source verdict. - --max-age routes through the new deleteLockRowIfStale helper from db-lock.ts (atomic age-gated DELETE; no TOCTOU). Healthy refreshing holders survive by construction; only wedged-but-alive holders trip. D-V3-5 partial-status consumer threading (conservative posture matching blocked_by_failures): - printSyncResult: new `case 'partial':` arm reports filesImported + reason; tells operator to re-run to continue. - manageGitignore (both single-source and parallel runOne sites, plus watch mode): excludes partial from the gate. A partial sync's db_only path set isn't fully reconciled. - Auto-embed-backfill enqueue inside runOne: excludes partial. The next clean sync will re-walk and re-decide. CLI flag parsing (T16): - parseDurationSeconds in sync-concurrency.ts: accepts 60s/10m/1h/bare int; rejects 0/negatives/decimals/garbage. Names the failing flag in the error message. - --timeout requires --source OR --all (validation rejects bare `gbrain sync --timeout`). - --max-age requires --break-lock; mutually exclusive with --force-break-lock. Coverage: - 15 unit cases (test/sync-timeout.test.ts) pin parseDurationSeconds + SyncResult union additivity. - 2 E2E cases (test/e2e/sync-parallel.test.ts) pin the abort-mid-import contract against real Postgres: status='partial', last_commit unchanged, filesImported bounded. Closes #1472 (RFC from @garrytan-agents) — CLI surface; schema foundation landed in the previous commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(heavy): sync_timeout_rescue.sh reproducer for the cron-cascade 10K-page seed × 4 sources × deliberately tight --timeout × 3 sequential cron emulations. Asserts every source reaches `last_commit === HEAD` within 3 waves. Proves the v0.41.15.0 fix breaks the cascade the PR #1472 RFC documented. Workload (tests/heavy/_sync_timeout_rescue_workload.ts) is PGLite-only because the PGLite engine forces serial sync internally (parallelEligible excludes it). The parallel-fan-out + per-source AbortController case lives in test/e2e/sync-parallel.test.ts against real Postgres. This heavy test pins the contract that matters for cron: aborts → partial returns → next wave content_hash-short-circuits + makes new progress. Smoke-tested locally at PAGES=50 WAVES=2 TIMEOUT_SECONDS=2: every source converges within 2 waves. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(v0.41.15.0): CHANGELOG + README + TODOS + version bump Bumps VERSION + package.json to 0.41.15.0 (next slot after master's v0.41.14.0). CHANGELOG entry leads ELI10 per gstack voice rules and documents the 3 intentional honest gaps: 1. --timeout covers pull + delete + rename + import only; extract + embed run to completion (D-V3-1 honest scope). 2. First 30 min after migration v98, --max-age cannot identify wedged pre-upgrade holders (D-V4-1 rollout trade-off). 3. Full-sync triggers (first sync, --full, chunker-version rewalk) don't respect --timeout yet (deferred to v0.42+). README troubleshooting section: paste-ready cron pattern with shell timeout(1) for OS-level process isolation + gbrain's --timeout for graceful self-termination half-a-minute earlier. TODOS.md: v0.42+ entries for subprocess fan-out (revisit if shell timeout(1) proves insufficient), full-sync --timeout coverage via AbortSignal in runImport, and runFactsBackstop microtask-queue process-alive caveat. llms-full.txt regenerated via `bun run build:llms`. Closes #1472 (RFC from @garrytan-agents). Credit to @garrytan-agents in the CHANGELOG for surfacing the production cron-failure data that motivated the work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(test-isolation): rewrite JSDoc to not match mock.module() lint regex scripts/check-test-isolation.sh greps for the literal string `mock.module(` to flag top-level module mocks (R2 rule — top-level mocks leak across files in the shard process). The regex doesn't know about comments, so my two new test files tripped the lint with JSDoc lines literally describing the rule: test/sync-timeout.test.ts:11 "* `mock.module()` (R2). Engine ..." test/sync-break-lock-all.test.ts:15 "* mock.module(), no process.env ..." Both files had ZERO actual mock.module() calls — only the comment text matched. Rewrote both JSDocs to refer to "top-level module mocks" instead of the literal token. Same meaning; doesn't trip the regex. `bun run check:test-isolation` now passes (714 non-serial unit files scanned). `bun run verify` clean (22/22 checks pass). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
8 tasks
mgunnin
added a commit
to mgunnin/gbrain
that referenced
this pull request
May 28, 2026
* upstream/master: v0.41.26.1 fix: lock-renewal cathedral — closes ~39 worker crashes/day (supersedes garrytan#1567) (garrytan#1572) v0.41.26.0 fix: dream --source + ingest junk titles + emoji-crash (supersedes garrytan#1559, garrytan#1561) (garrytan#1571) v0.41.25.0 perf(sync): batched deletes + global page-generation clock (supersedes garrytan#1538) (garrytan#1566) v0.41.24.0 fix(conversation-parser): threshold gates + bold-paren-time pattern — 20,167 Circleback messages unblocked (closes garrytan#1533) (garrytan#1543) v0.41.23.0 feat: extract operator surfaces + pack-driven extractables (garrytan#1541) v0.41.22.1 feat: brainstorm/lsd judge fixes (closes garrytan#1540 end-to-end) (garrytan#1562) v0.41.22.0 feat: type-unification cathedral — 94 types → 15 canonical (closes garrytan#1479) (garrytan#1542) v0.41.21.0 feat(ops): 5 daily-driver pains fixed in one wave (garrytan#1545) v0.41.20.0 feat: gbrain status + doctor --scope=brain (fix wave 2: items garrytan#6 + garrytan#7) (garrytan#1544) feat: v0.41.19.0 Supavisor Retry Cathedral (garrytan#1537) v0.41.18.0: gbrain onboard — the activation surface gbrain didn't have before (garrytan#1521) v0.41.17.0 feat: --workers N on every bulk command + facts dim doctor parity (garrytan#1519) v0.41.16.0 feat: conversation parser cathedral + progressive-batch primitive (closes garrytan#1461) (garrytan#1510) v0.41.15.0 feat(sync): --timeout + --max-age + partial status (closes garrytan#1472 RFC) (garrytan#1506)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The sync pipeline becomes a reliability bottleneck at 370K+ pages with 4 federated sources. Three compounding failure modes cause sources to fall behind:
sync --alltimes out, killing ALL sources including ones not yet reachedProduction data (last 24h): 8 of 12 sync cron runs timed out at 1803s. Sources going 50+ hours stale.
Proposal (RFC only — no code changes)
Four additive flags:
--timeout <seconds>— graceful self-termination, commits partial progress--break-lock-if-stale <seconds>— cron self-healing without manual--break-lock--independentmode — per-source process isolation (one source timing out doesn't affect others)sync --allas the default cron pathFull RFC: docs/rfcs/parallel-chunked-sync.md
Impact
All changes are additive —
sync --allcontinues to work as-is.