Skip to content

v0.41.15.0 feat(sync): --timeout + --max-age + partial status (closes #1472 RFC)#1506

Merged
garrytan merged 6 commits into
masterfrom
garrytan/beirut-v4
May 26, 2026
Merged

v0.41.15.0 feat(sync): --timeout + --max-age + partial status (closes #1472 RFC)#1506
garrytan merged 6 commits into
masterfrom
garrytan/beirut-v4

Conversation

@garrytan

Copy link
Copy Markdown
Owner

Summary

Two new CLI flags + a documented shell cron pattern that breaks the hourly sync cascade documented in the closed PR #1472 RFC from @garrytan-agents (8/12 hourly cron runs timing out, sources going 50+ hours stale, manual --break-lock per source for recovery).

  • gbrain sync --source X --timeout <s> — graceful self-termination. Per-source AbortController inside runOne so --timeout --all gives each source its own budget. Returns SyncResult { status: 'partial', filesImported, reason: 'timeout' | 'pull_timeout' } with last_commit UNCHANGED by construction (abort checks fire strictly before the bookmark write). CLI exits 0 so cron doesn't classify as failure.
  • gbrain sync --break-lock --all --max-age <s> — drops the prior --all refusal + adds age-gated breaking via last_refreshed_at semantic (not acquired_at). Healthy refreshing holders survive by construction; wedged-but-alive holders get correctly identified.
  • Recommended cron pattern (README + CHANGELOG): shell timeout(1) per-source loop for OS-level process isolation + gbrain --timeout for graceful self-termination half-a-minute earlier.

Honest scope (per the gstack voice rules)

Three deliberate gaps documented in CHANGELOG + failure-modes table:

  1. --timeout covers pull + delete + rename + import only. Extract + embed run to completion (D-V3-1 — bookmark write is the boundary).
  2. First 30 min after migration v98, --max-age cannot identify wedged pre-upgrade holders (D-V4-1 rollout-safety — backfill sets last_refreshed_at = NOW() to protect healthy old-binary holders).
  3. Full-sync triggers (first sync, --full, chunker-version rewalk) don't respect --timeout yet. Filed for v0.42+.

Architecture provenance

Plan went through 1 eng review (10 decisions) + 3 codex passes (24 findings, all absorbed via 17 architectural + 7 mechanical fixes). Plan file at ~/.claude/plans/system-instruction-you-are-working-parallel-valley.md. Codex's load-bearing catches in the rollout:

  • CMT1: rejected the v1 "add age-based steal to routine tryAcquireDbLock" path because withRefreshingLock keeps ttl_expires_at fresh but not acquired_at — would have stolen healthy long-running locks. Moved to the explicit runBreakLock SQL only.
  • CMT2: rejected --independent Minion fan-out — MinionWorker is in-process worker pool, not OS-process-per-source; waitForCompletion throws on timeout but doesn't cancel the underlying job. Shell timeout(1) per-source is the right shape.
  • D-V4-1: rejected backfilling last_refreshed_at = acquired_at in migration v98 — would let --max-age 1800 immediately delete healthy pre-upgrade holders during the rollout window. Backfilling = NOW() trades 30 min of degraded recovery for zero-risk rollout.

Commit chain (bisect-friendly)

  1. 3eee3817 feat(sync): migration v98 last_refreshed_at + deleteLockRowIfStale helper
  2. cd1c0e8a feat(sync): --timeout + --max-age + partial status + per-source AbortController
  3. 8b342120 test(heavy): sync_timeout_rescue.sh reproducer for the cron-cascade
  4. a9da5067 docs(v0.41.15.0): CHANGELOG + README + TODOS + version bump
  5. 7b9bda96 Merge origin/master (catch-up from v0.41.12 → v0.41.14)

Test plan

  • bun run typecheck clean
  • bun run test11,087 pass / 0 fail across 4 shards (553-1039s per shard, 18 minutes total)
  • bun run test:serial508 pass / 0 fail across 46 files
  • PGLite hermetic suite (13 cases in test/sync-break-lock-all.test.ts): tryAcquireDbLock writes last_refreshed_at on INSERT, refresh bumps both columns, inspectLock surfaces the new field, deleteLockRowIfStale refuses fresh / breaks stale / safe on holder_pid mismatch / refuses NULL (pre-v98)
  • Unit cases (15 cases in test/sync-timeout.test.ts): parseDurationSeconds full grammar (60s/10m/1h/bare int; reject 0/neg/decimals/garbage), partial JSON envelope additivity
  • E2E case extension (test/e2e/sync-parallel.test.ts): real Postgres, abort-mid-import returns partial, last_commit unchanged, filesImported bounded
  • Heavy reproducer (tests/heavy/sync_timeout_rescue.sh): smoke-tested at PAGES=50 × 4 sources × WAVES=2 × TIMEOUT=2s → every source converges within 2 waves; cascade is broken
  • gbrain --version reports 0.41.15.0 from compiled binary
  • CI version-gate trio: VERSION + package.json + CHANGELOG header all read 0.41.15.0
  • Local smoke against the CEO-class brain (run after merge — operator step, not CI)

🤖 Generated with Claude Code

garrytan and others added 5 commits May 26, 2026 08:29
…lper

Schema foundation for v0.41.15.0's `gbrain sync --break-lock --max-age <s>`
flag. Adds `gbrain_cycle_locks.last_refreshed_at TIMESTAMPTZ` as the
heartbeat signal that distinguishes wedged-but-alive lock holders from
healthy long-running syncs that are actively refreshing.

Why last_refreshed_at not acquired_at: `withRefreshingLock` already bumps
`ttl_expires_at` every ~5 min while work runs, but leaves `acquired_at` at
the original timestamp. A 35-min media-corpus sync that's healthy has
`acquired_at` 35 min ago but `last_refreshed_at` 30 seconds ago. Using
acquired_at for --max-age would steal healthy locks; last_refreshed_at
correctly identifies only holders whose JS interval has stopped firing.

D-V4-1 rollout safety: migration v98 backfills `last_refreshed_at = NOW()`
(NOT `= acquired_at`) so pre-upgrade holders running the old binary get a
30-min protection window. After that window all pre-upgrade syncs are
either complete (lock released) OR genuinely wedged (--max-age does the
right thing). Documented as a known caveat in CHANGELOG.

D-V4-mech-4 SQL cast: deleteLockRowIfStale uses `$N * INTERVAL '1 second'`
not `$N::interval` (Postgres does not cast integer to interval the latter
way). Atomic DELETE keyed on (id, holder_pid, last_refreshed_at < NOW() -
$N * INTERVAL '1 second') RETURNING id, last_refreshed_at — no TOCTOU
between inspect + delete.

D-V4-mech-3 schema-snapshot parity: column added to all 3 snapshots so
fresh init paths (pglite-schema.ts, schema.sql) initialize correctly
without depending on the migration runner. schema-embedded.ts regenerated
via `bun run build:schema`.

Pinned by 13 PGLite cases in test/sync-break-lock-all.test.ts:
tryAcquireDbLock writes on INSERT, withRefreshingLock refresh bumps both
columns, inspectLock surfaces the new field, deleteLockRowIfStale refuses
fresh / breaks stale / safe on holder_pid mismatch / refuses NULL
(pre-v98). R1 + R6 regression invariants from the v4 plan.

Closes #1472 (RFC from @garrytan-agents) — schema foundation only;
performSync abort threading + CLI flags + consumer threading land in
follow-up commits in this PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…Controller

The CLI surface for v0.41.15.0. Wires `gbrain sync --timeout <s>` (graceful
self-termination) and `gbrain sync --break-lock --all --max-age <s>`
(cron-self-heal) end-to-end through `performSync`, `runOne`, `runBreakLock`,
and all `SyncResult.status` consumers.

Surface 1: `gbrain sync --timeout <s>`
  - New `SyncOpts.signal?: AbortSignal` threads through `performSync` →
    `withRefreshingLock` work callback → `performSyncInner`.
  - D-V3-1 honest scope: abort checks fire ONLY in pre-bookmark phases
    (pull, delete, rename, import). Extract + embed run to completion if
    reached. The `last_commit` bookmark write at sync.ts:1261 is the
    invariant boundary — partial CANNOT advance the bookmark because the
    abort checkpoints sit strictly before that write.
  - D-V3-2 per-iteration: abort check at top of every loop iteration
    (delete, rename, serial import, each parallel worker's while loop)
    matches the per-file granularity the existing loops already have.
  - D-V3-3 per-source AbortController: `--timeout --all` creates ONE
    controller inside runOne per source so each gets its own budget;
    NOT a shared global controller (which would starve later sources).
    try/finally + timer.unref() guarantees cleanup on throw.
  - D-V4-mech-7 pull error.cause: pullRepo wraps execFileSync errors in
    GitOperationError. The catch inspects e.cause.code === 'ETIMEDOUT'
    and e.cause.signal === 'SIGTERM' (NOT the top-level error) to
    distinguish timeout (partial reason='pull_timeout') from ordinary
    pull failure (existing warn-and-continue, R2 invariant preserved).

Surface 2: `gbrain sync --break-lock [--all] [--max-age <s>]`
  - Drops the --all refusal at sync.ts:1610. When combined with --all,
    runBreakLock iterates every active source and prints per-source verdict.
  - --max-age routes through the new deleteLockRowIfStale helper from
    db-lock.ts (atomic age-gated DELETE; no TOCTOU). Healthy refreshing
    holders survive by construction; only wedged-but-alive holders trip.

D-V3-5 partial-status consumer threading (conservative posture matching
blocked_by_failures):
  - printSyncResult: new `case 'partial':` arm reports filesImported +
    reason; tells operator to re-run to continue.
  - manageGitignore (both single-source and parallel runOne sites,
    plus watch mode): excludes partial from the gate. A partial sync's
    db_only path set isn't fully reconciled.
  - Auto-embed-backfill enqueue inside runOne: excludes partial. The
    next clean sync will re-walk and re-decide.

CLI flag parsing (T16):
  - parseDurationSeconds in sync-concurrency.ts: accepts 60s/10m/1h/bare
    int; rejects 0/negatives/decimals/garbage. Names the failing flag in
    the error message.
  - --timeout requires --source OR --all (validation rejects bare
    `gbrain sync --timeout`).
  - --max-age requires --break-lock; mutually exclusive with
    --force-break-lock.

Coverage:
  - 15 unit cases (test/sync-timeout.test.ts) pin parseDurationSeconds +
    SyncResult union additivity.
  - 2 E2E cases (test/e2e/sync-parallel.test.ts) pin the abort-mid-import
    contract against real Postgres: status='partial', last_commit
    unchanged, filesImported bounded.

Closes #1472 (RFC from @garrytan-agents) — CLI surface; schema foundation
landed in the previous commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
10K-page seed × 4 sources × deliberately tight --timeout × 3 sequential
cron emulations. Asserts every source reaches `last_commit === HEAD`
within 3 waves. Proves the v0.41.15.0 fix breaks the cascade the
PR #1472 RFC documented.

Workload (tests/heavy/_sync_timeout_rescue_workload.ts) is PGLite-only
because the PGLite engine forces serial sync internally (parallelEligible
excludes it). The parallel-fan-out + per-source AbortController case
lives in test/e2e/sync-parallel.test.ts against real Postgres. This
heavy test pins the contract that matters for cron: aborts → partial
returns → next wave content_hash-short-circuits + makes new progress.

Smoke-tested locally at PAGES=50 WAVES=2 TIMEOUT_SECONDS=2: every
source converges within 2 waves.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bumps VERSION + package.json to 0.41.15.0 (next slot after master's
v0.41.14.0). CHANGELOG entry leads ELI10 per gstack voice rules and
documents the 3 intentional honest gaps:
  1. --timeout covers pull + delete + rename + import only; extract +
     embed run to completion (D-V3-1 honest scope).
  2. First 30 min after migration v98, --max-age cannot identify wedged
     pre-upgrade holders (D-V4-1 rollout trade-off).
  3. Full-sync triggers (first sync, --full, chunker-version rewalk)
     don't respect --timeout yet (deferred to v0.42+).

README troubleshooting section: paste-ready cron pattern with shell
timeout(1) for OS-level process isolation + gbrain's --timeout for
graceful self-termination half-a-minute earlier.

TODOS.md: v0.42+ entries for subprocess fan-out (revisit if shell
timeout(1) proves insufficient), full-sync --timeout coverage via
AbortSignal in runImport, and runFactsBackstop microtask-queue
process-alive caveat.

llms-full.txt regenerated via `bun run build:llms`.

Closes #1472 (RFC from @garrytan-agents). Credit to @garrytan-agents
in the CHANGELOG for surfacing the production cron-failure data that
motivated the work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
# Conflicts:
#	CHANGELOG.md
#	TODOS.md
#	VERSION
#	package.json
#	src/core/migrate.ts
scripts/check-test-isolation.sh greps for the literal string `mock.module(`
to flag top-level module mocks (R2 rule — top-level mocks leak across files
in the shard process). The regex doesn't know about comments, so my two new
test files tripped the lint with JSDoc lines literally describing the rule:

  test/sync-timeout.test.ts:11   "* `mock.module()` (R2). Engine ..."
  test/sync-break-lock-all.test.ts:15  "* mock.module(), no process.env ..."

Both files had ZERO actual mock.module() calls — only the comment text
matched. Rewrote both JSDocs to refer to "top-level module mocks" instead
of the literal token. Same meaning; doesn't trip the regex.

`bun run check:test-isolation` now passes (714 non-serial unit files
scanned). `bun run verify` clean (22/22 checks pass).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@garrytan garrytan merged commit cd8efee into master May 26, 2026
21 checks passed
mgunnin added a commit to mgunnin/gbrain that referenced this pull request May 28, 2026
* upstream/master:
  v0.41.26.1 fix: lock-renewal cathedral — closes ~39 worker crashes/day (supersedes garrytan#1567) (garrytan#1572)
  v0.41.26.0 fix: dream --source + ingest junk titles + emoji-crash (supersedes garrytan#1559, garrytan#1561) (garrytan#1571)
  v0.41.25.0 perf(sync): batched deletes + global page-generation clock (supersedes garrytan#1538) (garrytan#1566)
  v0.41.24.0 fix(conversation-parser): threshold gates + bold-paren-time pattern — 20,167 Circleback messages unblocked (closes garrytan#1533) (garrytan#1543)
  v0.41.23.0 feat: extract operator surfaces + pack-driven extractables (garrytan#1541)
  v0.41.22.1 feat: brainstorm/lsd judge fixes (closes garrytan#1540 end-to-end) (garrytan#1562)
  v0.41.22.0 feat: type-unification cathedral — 94 types → 15 canonical (closes garrytan#1479) (garrytan#1542)
  v0.41.21.0 feat(ops): 5 daily-driver pains fixed in one wave (garrytan#1545)
  v0.41.20.0 feat: gbrain status + doctor --scope=brain (fix wave 2: items garrytan#6 + garrytan#7) (garrytan#1544)
  feat: v0.41.19.0 Supavisor Retry Cathedral (garrytan#1537)
  v0.41.18.0: gbrain onboard — the activation surface gbrain didn't have before (garrytan#1521)
  v0.41.17.0 feat: --workers N on every bulk command + facts dim doctor parity (garrytan#1519)
  v0.41.16.0 feat: conversation parser cathedral + progressive-batch primitive (closes garrytan#1461) (garrytan#1510)
  v0.41.15.0 feat(sync): --timeout + --max-age + partial status (closes garrytan#1472 RFC) (garrytan#1506)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant