Skip to content

RFC: Parallel Chunked Sync for Large Federated Brains#1472

Closed
garrytan-agents wants to merge 1 commit into
garrytan:masterfrom
garrytan-agents:feat/parallel-chunked-sync
Closed

RFC: Parallel Chunked Sync for Large Federated Brains#1472
garrytan-agents wants to merge 1 commit into
garrytan:masterfrom
garrytan-agents:feat/parallel-chunked-sync

Conversation

@garrytan-agents

Copy link
Copy Markdown
Contributor

Problem

The sync pipeline becomes a reliability bottleneck at 370K+ pages with 4 federated sources. Three compounding failure modes cause sources to fall behind:

  1. Serial timeout cascade — monolithic sync --all times out, killing ALL sources including ones not yet reached
  2. Lock contention — stale locks from timed-out runs block the next cron wave
  3. Embed blocks sync — inline embed eats into the timeout budget

Production data (last 24h): 8 of 12 sync cron runs timed out at 1803s. Sources going 50+ hours stale.

Proposal (RFC only — no code changes)

Four additive flags:

  • --timeout <seconds> — graceful self-termination, commits partial progress
  • --break-lock-if-stale <seconds> — cron self-healing without manual --break-lock
  • --independent mode — per-source process isolation (one source timing out doesn't affect others)
  • Deprecate monolithic sync --all as the default cron path

Full RFC: docs/rfcs/parallel-chunked-sync.md

Impact

Metric Current After
Sync reliability ~33% (4/12 runs) ~95%+
Max source staleness 50+ hours < 2 hours
Recovery from timeout Manual intervention Automatic

All changes are additive — sync --all continues to work as-is.

Documents three compounding failure modes in the sync pipeline:
1. Serial timeout cascade (monolithic sync --all times out, kills all sources)
2. Lock contention between cron waves (stale locks from timed-out runs)
3. Embed phase blocks sync phase (inline embed eats timeout budget)

Proposes four additive changes:
- --timeout flag for graceful self-termination
- --break-lock-if-stale for cron self-healing
- --independent mode for per-source process isolation
- Deprecate monolithic sync --all as default cron path

Production data: 8/12 sync runs timed out in last 24h, sources going 50+ hours stale.
@garrytan

Copy link
Copy Markdown
Owner

Closing — the actual fix shipped in PR #1506 as v0.41.15.0.

Your RFC was correct in every diagnosis. The shipped plan reduced scope from your 4 surfaces to 2 surfaces + a documented shell-level cron pattern, after iteration with codex over 3 review passes on cascade-resilience and lock-stealing invariants. Specifically:

  • --independent (Minion fan-out) was rejected because MinionWorker is an in-process worker pool, not OS-process-per-source — the "process isolation" claim was oversold. waitForCompletion throws on timeout but doesn't cancel the underlying job (leaving a hot lock for the next cron). Shell timeout(1) per-source loop gives real OS process isolation with zero new gbrain code. Documented as the recommended cron pattern in CHANGELOG + README.
  • --break-lock-if-stale was replaced by extending the existing --break-lock with --all + --max-age. The semantic changed from "time since acquired" to "time since last refresh" via a new last_refreshed_at column (migration v98) so healthy long-running holders are safe by construction. Without that change, the cron pattern would have stolen healthy locks during the upgrade window.
  • --timeout ships as you proposed, with one honest scope statement: it covers pull + delete + rename + import only. Extract + embed run to completion after the bookmark write. Full-sync triggers (first sync, --full, chunker-bump rewalk) are deferred to v0.42+.

Credit for surfacing the production data lives in the v0.41.15.0 CHANGELOG entry.

Closing as superseded.

@garrytan garrytan closed this May 26, 2026
garrytan added a commit that referenced this pull request May 26, 2026
…1472 RFC) (#1506)

* feat(sync): migration v98 last_refreshed_at + deleteLockRowIfStale helper

Schema foundation for v0.41.15.0's `gbrain sync --break-lock --max-age <s>`
flag. Adds `gbrain_cycle_locks.last_refreshed_at TIMESTAMPTZ` as the
heartbeat signal that distinguishes wedged-but-alive lock holders from
healthy long-running syncs that are actively refreshing.

Why last_refreshed_at not acquired_at: `withRefreshingLock` already bumps
`ttl_expires_at` every ~5 min while work runs, but leaves `acquired_at` at
the original timestamp. A 35-min media-corpus sync that's healthy has
`acquired_at` 35 min ago but `last_refreshed_at` 30 seconds ago. Using
acquired_at for --max-age would steal healthy locks; last_refreshed_at
correctly identifies only holders whose JS interval has stopped firing.

D-V4-1 rollout safety: migration v98 backfills `last_refreshed_at = NOW()`
(NOT `= acquired_at`) so pre-upgrade holders running the old binary get a
30-min protection window. After that window all pre-upgrade syncs are
either complete (lock released) OR genuinely wedged (--max-age does the
right thing). Documented as a known caveat in CHANGELOG.

D-V4-mech-4 SQL cast: deleteLockRowIfStale uses `$N * INTERVAL '1 second'`
not `$N::interval` (Postgres does not cast integer to interval the latter
way). Atomic DELETE keyed on (id, holder_pid, last_refreshed_at < NOW() -
$N * INTERVAL '1 second') RETURNING id, last_refreshed_at — no TOCTOU
between inspect + delete.

D-V4-mech-3 schema-snapshot parity: column added to all 3 snapshots so
fresh init paths (pglite-schema.ts, schema.sql) initialize correctly
without depending on the migration runner. schema-embedded.ts regenerated
via `bun run build:schema`.

Pinned by 13 PGLite cases in test/sync-break-lock-all.test.ts:
tryAcquireDbLock writes on INSERT, withRefreshingLock refresh bumps both
columns, inspectLock surfaces the new field, deleteLockRowIfStale refuses
fresh / breaks stale / safe on holder_pid mismatch / refuses NULL
(pre-v98). R1 + R6 regression invariants from the v4 plan.

Closes #1472 (RFC from @garrytan-agents) — schema foundation only;
performSync abort threading + CLI flags + consumer threading land in
follow-up commits in this PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(sync): --timeout + --max-age + partial status + per-source AbortController

The CLI surface for v0.41.15.0. Wires `gbrain sync --timeout <s>` (graceful
self-termination) and `gbrain sync --break-lock --all --max-age <s>`
(cron-self-heal) end-to-end through `performSync`, `runOne`, `runBreakLock`,
and all `SyncResult.status` consumers.

Surface 1: `gbrain sync --timeout <s>`
  - New `SyncOpts.signal?: AbortSignal` threads through `performSync` →
    `withRefreshingLock` work callback → `performSyncInner`.
  - D-V3-1 honest scope: abort checks fire ONLY in pre-bookmark phases
    (pull, delete, rename, import). Extract + embed run to completion if
    reached. The `last_commit` bookmark write at sync.ts:1261 is the
    invariant boundary — partial CANNOT advance the bookmark because the
    abort checkpoints sit strictly before that write.
  - D-V3-2 per-iteration: abort check at top of every loop iteration
    (delete, rename, serial import, each parallel worker's while loop)
    matches the per-file granularity the existing loops already have.
  - D-V3-3 per-source AbortController: `--timeout --all` creates ONE
    controller inside runOne per source so each gets its own budget;
    NOT a shared global controller (which would starve later sources).
    try/finally + timer.unref() guarantees cleanup on throw.
  - D-V4-mech-7 pull error.cause: pullRepo wraps execFileSync errors in
    GitOperationError. The catch inspects e.cause.code === 'ETIMEDOUT'
    and e.cause.signal === 'SIGTERM' (NOT the top-level error) to
    distinguish timeout (partial reason='pull_timeout') from ordinary
    pull failure (existing warn-and-continue, R2 invariant preserved).

Surface 2: `gbrain sync --break-lock [--all] [--max-age <s>]`
  - Drops the --all refusal at sync.ts:1610. When combined with --all,
    runBreakLock iterates every active source and prints per-source verdict.
  - --max-age routes through the new deleteLockRowIfStale helper from
    db-lock.ts (atomic age-gated DELETE; no TOCTOU). Healthy refreshing
    holders survive by construction; only wedged-but-alive holders trip.

D-V3-5 partial-status consumer threading (conservative posture matching
blocked_by_failures):
  - printSyncResult: new `case 'partial':` arm reports filesImported +
    reason; tells operator to re-run to continue.
  - manageGitignore (both single-source and parallel runOne sites,
    plus watch mode): excludes partial from the gate. A partial sync's
    db_only path set isn't fully reconciled.
  - Auto-embed-backfill enqueue inside runOne: excludes partial. The
    next clean sync will re-walk and re-decide.

CLI flag parsing (T16):
  - parseDurationSeconds in sync-concurrency.ts: accepts 60s/10m/1h/bare
    int; rejects 0/negatives/decimals/garbage. Names the failing flag in
    the error message.
  - --timeout requires --source OR --all (validation rejects bare
    `gbrain sync --timeout`).
  - --max-age requires --break-lock; mutually exclusive with
    --force-break-lock.

Coverage:
  - 15 unit cases (test/sync-timeout.test.ts) pin parseDurationSeconds +
    SyncResult union additivity.
  - 2 E2E cases (test/e2e/sync-parallel.test.ts) pin the abort-mid-import
    contract against real Postgres: status='partial', last_commit
    unchanged, filesImported bounded.

Closes #1472 (RFC from @garrytan-agents) — CLI surface; schema foundation
landed in the previous commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(heavy): sync_timeout_rescue.sh reproducer for the cron-cascade

10K-page seed × 4 sources × deliberately tight --timeout × 3 sequential
cron emulations. Asserts every source reaches `last_commit === HEAD`
within 3 waves. Proves the v0.41.15.0 fix breaks the cascade the
PR #1472 RFC documented.

Workload (tests/heavy/_sync_timeout_rescue_workload.ts) is PGLite-only
because the PGLite engine forces serial sync internally (parallelEligible
excludes it). The parallel-fan-out + per-source AbortController case
lives in test/e2e/sync-parallel.test.ts against real Postgres. This
heavy test pins the contract that matters for cron: aborts → partial
returns → next wave content_hash-short-circuits + makes new progress.

Smoke-tested locally at PAGES=50 WAVES=2 TIMEOUT_SECONDS=2: every
source converges within 2 waves.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(v0.41.15.0): CHANGELOG + README + TODOS + version bump

Bumps VERSION + package.json to 0.41.15.0 (next slot after master's
v0.41.14.0). CHANGELOG entry leads ELI10 per gstack voice rules and
documents the 3 intentional honest gaps:
  1. --timeout covers pull + delete + rename + import only; extract +
     embed run to completion (D-V3-1 honest scope).
  2. First 30 min after migration v98, --max-age cannot identify wedged
     pre-upgrade holders (D-V4-1 rollout trade-off).
  3. Full-sync triggers (first sync, --full, chunker-version rewalk)
     don't respect --timeout yet (deferred to v0.42+).

README troubleshooting section: paste-ready cron pattern with shell
timeout(1) for OS-level process isolation + gbrain's --timeout for
graceful self-termination half-a-minute earlier.

TODOS.md: v0.42+ entries for subprocess fan-out (revisit if shell
timeout(1) proves insufficient), full-sync --timeout coverage via
AbortSignal in runImport, and runFactsBackstop microtask-queue
process-alive caveat.

llms-full.txt regenerated via `bun run build:llms`.

Closes #1472 (RFC from @garrytan-agents). Credit to @garrytan-agents
in the CHANGELOG for surfacing the production cron-failure data that
motivated the work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(test-isolation): rewrite JSDoc to not match mock.module() lint regex

scripts/check-test-isolation.sh greps for the literal string `mock.module(`
to flag top-level module mocks (R2 rule — top-level mocks leak across files
in the shard process). The regex doesn't know about comments, so my two new
test files tripped the lint with JSDoc lines literally describing the rule:

  test/sync-timeout.test.ts:11   "* `mock.module()` (R2). Engine ..."
  test/sync-break-lock-all.test.ts:15  "* mock.module(), no process.env ..."

Both files had ZERO actual mock.module() calls — only the comment text
matched. Rewrote both JSDocs to refer to "top-level module mocks" instead
of the literal token. Same meaning; doesn't trip the regex.

`bun run check:test-isolation` now passes (714 non-serial unit files
scanned). `bun run verify` clean (22/22 checks pass).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mgunnin added a commit to mgunnin/gbrain that referenced this pull request May 28, 2026
* upstream/master:
  v0.41.26.1 fix: lock-renewal cathedral — closes ~39 worker crashes/day (supersedes garrytan#1567) (garrytan#1572)
  v0.41.26.0 fix: dream --source + ingest junk titles + emoji-crash (supersedes garrytan#1559, garrytan#1561) (garrytan#1571)
  v0.41.25.0 perf(sync): batched deletes + global page-generation clock (supersedes garrytan#1538) (garrytan#1566)
  v0.41.24.0 fix(conversation-parser): threshold gates + bold-paren-time pattern — 20,167 Circleback messages unblocked (closes garrytan#1533) (garrytan#1543)
  v0.41.23.0 feat: extract operator surfaces + pack-driven extractables (garrytan#1541)
  v0.41.22.1 feat: brainstorm/lsd judge fixes (closes garrytan#1540 end-to-end) (garrytan#1562)
  v0.41.22.0 feat: type-unification cathedral — 94 types → 15 canonical (closes garrytan#1479) (garrytan#1542)
  v0.41.21.0 feat(ops): 5 daily-driver pains fixed in one wave (garrytan#1545)
  v0.41.20.0 feat: gbrain status + doctor --scope=brain (fix wave 2: items garrytan#6 + garrytan#7) (garrytan#1544)
  feat: v0.41.19.0 Supavisor Retry Cathedral (garrytan#1537)
  v0.41.18.0: gbrain onboard — the activation surface gbrain didn't have before (garrytan#1521)
  v0.41.17.0 feat: --workers N on every bulk command + facts dim doctor parity (garrytan#1519)
  v0.41.16.0 feat: conversation parser cathedral + progressive-batch primitive (closes garrytan#1461) (garrytan#1510)
  v0.41.15.0 feat(sync): --timeout + --max-age + partial status (closes garrytan#1472 RFC) (garrytan#1506)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants