v0.41.15.0 feat(sync): --timeout + --max-age + partial status (closes #1472 RFC)#1506
Merged
Conversation
…lper Schema foundation for v0.41.15.0's `gbrain sync --break-lock --max-age <s>` flag. Adds `gbrain_cycle_locks.last_refreshed_at TIMESTAMPTZ` as the heartbeat signal that distinguishes wedged-but-alive lock holders from healthy long-running syncs that are actively refreshing. Why last_refreshed_at not acquired_at: `withRefreshingLock` already bumps `ttl_expires_at` every ~5 min while work runs, but leaves `acquired_at` at the original timestamp. A 35-min media-corpus sync that's healthy has `acquired_at` 35 min ago but `last_refreshed_at` 30 seconds ago. Using acquired_at for --max-age would steal healthy locks; last_refreshed_at correctly identifies only holders whose JS interval has stopped firing. D-V4-1 rollout safety: migration v98 backfills `last_refreshed_at = NOW()` (NOT `= acquired_at`) so pre-upgrade holders running the old binary get a 30-min protection window. After that window all pre-upgrade syncs are either complete (lock released) OR genuinely wedged (--max-age does the right thing). Documented as a known caveat in CHANGELOG. D-V4-mech-4 SQL cast: deleteLockRowIfStale uses `$N * INTERVAL '1 second'` not `$N::interval` (Postgres does not cast integer to interval the latter way). Atomic DELETE keyed on (id, holder_pid, last_refreshed_at < NOW() - $N * INTERVAL '1 second') RETURNING id, last_refreshed_at — no TOCTOU between inspect + delete. D-V4-mech-3 schema-snapshot parity: column added to all 3 snapshots so fresh init paths (pglite-schema.ts, schema.sql) initialize correctly without depending on the migration runner. schema-embedded.ts regenerated via `bun run build:schema`. Pinned by 13 PGLite cases in test/sync-break-lock-all.test.ts: tryAcquireDbLock writes on INSERT, withRefreshingLock refresh bumps both columns, inspectLock surfaces the new field, deleteLockRowIfStale refuses fresh / breaks stale / safe on holder_pid mismatch / refuses NULL (pre-v98). R1 + R6 regression invariants from the v4 plan. Closes #1472 (RFC from @garrytan-agents) — schema foundation only; performSync abort threading + CLI flags + consumer threading land in follow-up commits in this PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…Controller
The CLI surface for v0.41.15.0. Wires `gbrain sync --timeout <s>` (graceful
self-termination) and `gbrain sync --break-lock --all --max-age <s>`
(cron-self-heal) end-to-end through `performSync`, `runOne`, `runBreakLock`,
and all `SyncResult.status` consumers.
Surface 1: `gbrain sync --timeout <s>`
- New `SyncOpts.signal?: AbortSignal` threads through `performSync` →
`withRefreshingLock` work callback → `performSyncInner`.
- D-V3-1 honest scope: abort checks fire ONLY in pre-bookmark phases
(pull, delete, rename, import). Extract + embed run to completion if
reached. The `last_commit` bookmark write at sync.ts:1261 is the
invariant boundary — partial CANNOT advance the bookmark because the
abort checkpoints sit strictly before that write.
- D-V3-2 per-iteration: abort check at top of every loop iteration
(delete, rename, serial import, each parallel worker's while loop)
matches the per-file granularity the existing loops already have.
- D-V3-3 per-source AbortController: `--timeout --all` creates ONE
controller inside runOne per source so each gets its own budget;
NOT a shared global controller (which would starve later sources).
try/finally + timer.unref() guarantees cleanup on throw.
- D-V4-mech-7 pull error.cause: pullRepo wraps execFileSync errors in
GitOperationError. The catch inspects e.cause.code === 'ETIMEDOUT'
and e.cause.signal === 'SIGTERM' (NOT the top-level error) to
distinguish timeout (partial reason='pull_timeout') from ordinary
pull failure (existing warn-and-continue, R2 invariant preserved).
Surface 2: `gbrain sync --break-lock [--all] [--max-age <s>]`
- Drops the --all refusal at sync.ts:1610. When combined with --all,
runBreakLock iterates every active source and prints per-source verdict.
- --max-age routes through the new deleteLockRowIfStale helper from
db-lock.ts (atomic age-gated DELETE; no TOCTOU). Healthy refreshing
holders survive by construction; only wedged-but-alive holders trip.
D-V3-5 partial-status consumer threading (conservative posture matching
blocked_by_failures):
- printSyncResult: new `case 'partial':` arm reports filesImported +
reason; tells operator to re-run to continue.
- manageGitignore (both single-source and parallel runOne sites,
plus watch mode): excludes partial from the gate. A partial sync's
db_only path set isn't fully reconciled.
- Auto-embed-backfill enqueue inside runOne: excludes partial. The
next clean sync will re-walk and re-decide.
CLI flag parsing (T16):
- parseDurationSeconds in sync-concurrency.ts: accepts 60s/10m/1h/bare
int; rejects 0/negatives/decimals/garbage. Names the failing flag in
the error message.
- --timeout requires --source OR --all (validation rejects bare
`gbrain sync --timeout`).
- --max-age requires --break-lock; mutually exclusive with
--force-break-lock.
Coverage:
- 15 unit cases (test/sync-timeout.test.ts) pin parseDurationSeconds +
SyncResult union additivity.
- 2 E2E cases (test/e2e/sync-parallel.test.ts) pin the abort-mid-import
contract against real Postgres: status='partial', last_commit
unchanged, filesImported bounded.
Closes #1472 (RFC from @garrytan-agents) — CLI surface; schema foundation
landed in the previous commit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
10K-page seed × 4 sources × deliberately tight --timeout × 3 sequential cron emulations. Asserts every source reaches `last_commit === HEAD` within 3 waves. Proves the v0.41.15.0 fix breaks the cascade the PR #1472 RFC documented. Workload (tests/heavy/_sync_timeout_rescue_workload.ts) is PGLite-only because the PGLite engine forces serial sync internally (parallelEligible excludes it). The parallel-fan-out + per-source AbortController case lives in test/e2e/sync-parallel.test.ts against real Postgres. This heavy test pins the contract that matters for cron: aborts → partial returns → next wave content_hash-short-circuits + makes new progress. Smoke-tested locally at PAGES=50 WAVES=2 TIMEOUT_SECONDS=2: every source converges within 2 waves. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bumps VERSION + package.json to 0.41.15.0 (next slot after master's
v0.41.14.0). CHANGELOG entry leads ELI10 per gstack voice rules and
documents the 3 intentional honest gaps:
1. --timeout covers pull + delete + rename + import only; extract +
embed run to completion (D-V3-1 honest scope).
2. First 30 min after migration v98, --max-age cannot identify wedged
pre-upgrade holders (D-V4-1 rollout trade-off).
3. Full-sync triggers (first sync, --full, chunker-version rewalk)
don't respect --timeout yet (deferred to v0.42+).
README troubleshooting section: paste-ready cron pattern with shell
timeout(1) for OS-level process isolation + gbrain's --timeout for
graceful self-termination half-a-minute earlier.
TODOS.md: v0.42+ entries for subprocess fan-out (revisit if shell
timeout(1) proves insufficient), full-sync --timeout coverage via
AbortSignal in runImport, and runFactsBackstop microtask-queue
process-alive caveat.
llms-full.txt regenerated via `bun run build:llms`.
Closes #1472 (RFC from @garrytan-agents). Credit to @garrytan-agents
in the CHANGELOG for surfacing the production cron-failure data that
motivated the work.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
# Conflicts: # CHANGELOG.md # TODOS.md # VERSION # package.json # src/core/migrate.ts
scripts/check-test-isolation.sh greps for the literal string `mock.module(` to flag top-level module mocks (R2 rule — top-level mocks leak across files in the shard process). The regex doesn't know about comments, so my two new test files tripped the lint with JSDoc lines literally describing the rule: test/sync-timeout.test.ts:11 "* `mock.module()` (R2). Engine ..." test/sync-break-lock-all.test.ts:15 "* mock.module(), no process.env ..." Both files had ZERO actual mock.module() calls — only the comment text matched. Rewrote both JSDocs to refer to "top-level module mocks" instead of the literal token. Same meaning; doesn't trip the regex. `bun run check:test-isolation` now passes (714 non-serial unit files scanned). `bun run verify` clean (22/22 checks pass). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 27, 2026
mgunnin
added a commit
to mgunnin/gbrain
that referenced
this pull request
May 28, 2026
* upstream/master: v0.41.26.1 fix: lock-renewal cathedral — closes ~39 worker crashes/day (supersedes garrytan#1567) (garrytan#1572) v0.41.26.0 fix: dream --source + ingest junk titles + emoji-crash (supersedes garrytan#1559, garrytan#1561) (garrytan#1571) v0.41.25.0 perf(sync): batched deletes + global page-generation clock (supersedes garrytan#1538) (garrytan#1566) v0.41.24.0 fix(conversation-parser): threshold gates + bold-paren-time pattern — 20,167 Circleback messages unblocked (closes garrytan#1533) (garrytan#1543) v0.41.23.0 feat: extract operator surfaces + pack-driven extractables (garrytan#1541) v0.41.22.1 feat: brainstorm/lsd judge fixes (closes garrytan#1540 end-to-end) (garrytan#1562) v0.41.22.0 feat: type-unification cathedral — 94 types → 15 canonical (closes garrytan#1479) (garrytan#1542) v0.41.21.0 feat(ops): 5 daily-driver pains fixed in one wave (garrytan#1545) v0.41.20.0 feat: gbrain status + doctor --scope=brain (fix wave 2: items garrytan#6 + garrytan#7) (garrytan#1544) feat: v0.41.19.0 Supavisor Retry Cathedral (garrytan#1537) v0.41.18.0: gbrain onboard — the activation surface gbrain didn't have before (garrytan#1521) v0.41.17.0 feat: --workers N on every bulk command + facts dim doctor parity (garrytan#1519) v0.41.16.0 feat: conversation parser cathedral + progressive-batch primitive (closes garrytan#1461) (garrytan#1510) v0.41.15.0 feat(sync): --timeout + --max-age + partial status (closes garrytan#1472 RFC) (garrytan#1506)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two new CLI flags + a documented shell cron pattern that breaks the hourly sync cascade documented in the closed PR #1472 RFC from @garrytan-agents (8/12 hourly cron runs timing out, sources going 50+ hours stale, manual
--break-lockper source for recovery).gbrain sync --source X --timeout <s>— graceful self-termination. Per-sourceAbortControllerinsiderunOneso--timeout --allgives each source its own budget. ReturnsSyncResult { status: 'partial', filesImported, reason: 'timeout' | 'pull_timeout' }withlast_commitUNCHANGED by construction (abort checks fire strictly before the bookmark write). CLI exits 0 so cron doesn't classify as failure.gbrain sync --break-lock --all --max-age <s>— drops the prior--allrefusal + adds age-gated breaking vialast_refreshed_atsemantic (notacquired_at). Healthy refreshing holders survive by construction; wedged-but-alive holders get correctly identified.timeout(1)per-source loop for OS-level process isolation + gbrain--timeoutfor graceful self-termination half-a-minute earlier.Honest scope (per the gstack voice rules)
Three deliberate gaps documented in CHANGELOG + failure-modes table:
--timeoutcovers pull + delete + rename + import only. Extract + embed run to completion (D-V3-1 — bookmark write is the boundary).--max-agecannot identify wedged pre-upgrade holders (D-V4-1 rollout-safety — backfill setslast_refreshed_at = NOW()to protect healthy old-binary holders).--full, chunker-version rewalk) don't respect--timeoutyet. Filed for v0.42+.Architecture provenance
Plan went through 1 eng review (10 decisions) + 3 codex passes (24 findings, all absorbed via 17 architectural + 7 mechanical fixes). Plan file at
~/.claude/plans/system-instruction-you-are-working-parallel-valley.md. Codex's load-bearing catches in the rollout:tryAcquireDbLock" path becausewithRefreshingLockkeepsttl_expires_atfresh but notacquired_at— would have stolen healthy long-running locks. Moved to the explicitrunBreakLockSQL only.--independentMinion fan-out —MinionWorkeris in-process worker pool, not OS-process-per-source;waitForCompletionthrows on timeout but doesn't cancel the underlying job. Shelltimeout(1)per-source is the right shape.last_refreshed_at = acquired_atin migration v98 — would let--max-age 1800immediately delete healthy pre-upgrade holders during the rollout window. Backfilling= NOW()trades 30 min of degraded recovery for zero-risk rollout.Commit chain (bisect-friendly)
3eee3817feat(sync): migration v98 last_refreshed_at + deleteLockRowIfStale helpercd1c0e8afeat(sync): --timeout + --max-age + partial status + per-source AbortController8b342120test(heavy): sync_timeout_rescue.sh reproducer for the cron-cascadea9da5067docs(v0.41.15.0): CHANGELOG + README + TODOS + version bump7b9bda96Merge origin/master (catch-up from v0.41.12 → v0.41.14)Test plan
bun run typecheckcleanbun run test→ 11,087 pass / 0 fail across 4 shards (553-1039s per shard, 18 minutes total)bun run test:serial→ 508 pass / 0 fail across 46 filestest/sync-break-lock-all.test.ts): tryAcquireDbLock writeslast_refreshed_aton INSERT, refresh bumps both columns, inspectLock surfaces the new field, deleteLockRowIfStale refuses fresh / breaks stale / safe on holder_pid mismatch / refuses NULL (pre-v98)test/sync-timeout.test.ts): parseDurationSeconds full grammar (60s/10m/1h/bare int; reject 0/neg/decimals/garbage), partial JSON envelope additivitytest/e2e/sync-parallel.test.ts): real Postgres, abort-mid-import returns partial,last_commitunchanged,filesImportedboundedtests/heavy/sync_timeout_rescue.sh): smoke-tested at PAGES=50 × 4 sources × WAVES=2 × TIMEOUT=2s → every source converges within 2 waves; cascade is brokengbrain --versionreports0.41.15.0from compiled binary0.41.15.0🤖 Generated with Claude Code