Skip to content

v0.41.26.1 fix: lock-renewal cathedral — closes ~39 worker crashes/day (supersedes #1567)#1572

Merged
garrytan merged 3 commits into
masterfrom
garrytan/lock-renewal-crash-fix
May 28, 2026
Merged

v0.41.26.1 fix: lock-renewal cathedral — closes ~39 worker crashes/day (supersedes #1567)#1572
garrytan merged 3 commits into
masterfrom
garrytan/lock-renewal-crash-fix

Conversation

@garrytan

Copy link
Copy Markdown
Owner

Summary

  • Closes the production worker crash class that was dropping ~39 daemons/day on Supabase + PgBouncer with unhandledRejection at renewLock.
  • Closes a second unhandledRejection vector the inside-review missed: the stored executeJob(...).finally(...) promise had no .catch(), so failJob/completeJob throws during the same DB outage would crash the daemon too.
  • Privacy backfill for audit JSONL — new shared redactConnectionInfo helper strips Postgres URLs/host/user/password/IPv4 from error messages before they hit disk. Wired into both new lock-renewal-audit AND existing batch-retry-audit (same risk class).

Why this supersedes #1567

PR #1567 (contributor @garrytan-agents) proposed the right try/catch shape but was missing:

  • Re-entrancy guard (overlapping ticks during PgBouncer stalls would pile concurrent connection acquisitions).
  • Per-call timeout (hung renewLock would wedge the worker forever).
  • Time-based abort (count-based 3-strike at lockDuration=30s let other workers reclaim BEFORE we aborted — concurrent execution risk).
  • Infrastructure-reason guard (PgBouncer blips would dead-letter healthy jobs).
  • Universal grace-eviction (the 30s force-evict safety net only fired for job.timeout_ms, not lock-renewal aborts).
  • Privacy redactor (audit JSONL leaked credentials).
  • Tests (zero coverage on the regression class).
  • Bundled unrelated work (safeSliceEnd UTF-16 surrogate fix — different bug class, filed as TODO-LR-1).

The plan-eng-review surfaced 4 of these gaps inside, and codex outside-voice surfaced the other 4. All 8 absorbed via 9 locked design decisions (D1-D9) captured in the plan file.

Architecture

Pure-function extraction is the load-bearing change:

  • src/core/minions/lock-renewal-tick.tsrunLockRenewalTick(deps, state) returns a tagged union ('ok' | 'cancelled' | 'lock_lost' | 'should_abort'). All side effects (clearInterval, abort.abort, finally cleanup) stay in the worker's closure scope. Tests drive the state machine with fake deps — no setInterval plumbing, no PGLite, no mock.module.
  • worker.ts:launchJob reduces to a thin setInterval(() => void runLockRenewalTick(...)) sync wrapper.
  • scripts/check-worker-lock-renewal-shape.sh CI guard wires into bun run verify. Asserts the bug pattern (lockTimer = setInterval(async ...)) stays absent AND the call site survives refactors. Narrowed via lockTimer = prefix so unrelated setInterval(async) calls elsewhere in the file (e.g. stall detector — codex C13, filed separately as TODO-LR-4) don't false-fire.

Test plan

  • 18 hermetic pure-function tests on runLockRenewalTick (state machine paths, cancellation, time-based abort, hung-renewLock timeout, audit defense-in-depth, env-knob resolution)
  • 11 audit module tests (4-outcome contract, redactor wired, readback, corrupted-line tolerance, pruning)
  • 15 redactor unit tests (all 5 patterns, IPv4-vs-version-number defense, idempotency, real Supabase/ENOTFOUND fixtures)
  • 3 privacy backfill regression tests on batch-retry-audit (snapshot JSONL — no DSN/host/password)
  • 5 CI guard meta-tests (good shape passes, bug pattern fails, missing call-site fails, false-positive defense for stall-detector pattern elsewhere)
  • 12 existing batch-retry audit tests still pass (privacy backfill non-breaking)
  • 169 existing minion tests still pass
  • 13 existing worker-shutdown / worker-rss tests still pass
  • bun run verify green (29 checks including the new shape guard)

Total: 64 new cases + 182 existing tests verified. All hermetic.

Test failures in the parallel suite (longmemeval timeouts, schema-pack timeouts, bootstrap PGLite cold-start timeouts, query-cache flakes, eval-prune timeouts) are pre-existing infrastructure-flake under heavy load — none touch any code in this PR.

Operator verification after upgrade

# Confirm the shape guard is wired.
bun run check:worker-lock-renewal-shape  # → "lock-renewal shape OK"

# Watch the audit channel during a real Supabase reconnect.
tail -F ~/.gbrain/audit/lock-renewal-*.jsonl

# Force a blip if you can:
docker exec your-pgbouncer-container pkill -HUP pgbouncer
# Expected: 1-2 `failure` events then `success_after_failure` on recovery.
# Worker process MUST NOT exit. If it does, file an issue.

Closes

🤖 Generated with Claude Code

…y (supersedes #1567)

Production worker daemons against Supabase / PgBouncer were crashing
~39 times/day with `unhandledRejection at renewLock`. PR #1567
proposed the right try/catch shape; this wave incorporates it and
closes the entire bug class (4 inside-review + 8 outside-voice
findings absorbed via 9 locked design decisions).

What's fixed:

- `setInterval(async () => await renewLock(...))` replaced with a
  sync wrapper around the new pure `runLockRenewalTick` function.
  No more unhandled rejections escaping the timer callback.
- Second crash vector closed: `.catch()` on the stored
  `executeJob(...).finally(...)` promise so failJob/completeJob
  throws during the same outage can't propagate to
  `process.on('unhandledRejection')`.
- Per-call `Promise.race` timeout (default `lockDuration/3`) bounds
  hung renewLock calls so the re-entrancy guard can't wedge
  indefinitely.
- Time-based abort (NOT count-based) so the worker releases its
  lock BEFORE another worker can reclaim. With the prior 3-strike
  count + 30s lockDuration, a 15s window let other workers race.
- Infrastructure aborts (`lock-renewal-failed`, `lock-lost`) don't
  burn job attempts — `executeJob`'s catch consults the exported
  `INFRASTRUCTURE_ABORT_REASONS` set and skips `failJob` so the
  stall detector reclaims cleanly.
- Universal grace-eviction: 30s force-evict safety net now fires
  for ANY abort reason, not just `job.timeout_ms`.

What's added:

- `src/core/minions/lock-renewal-tick.ts` (NEW): pure extracted
  state-machine function + env-knob resolver. Three operator-tunable
  knobs via env (max-failures-for-audit, call-timeout-ms,
  safety-margin-ms) with stderr-warn-once on bad input + default
  fallback.
- `src/core/audit/lock-renewal-audit.ts` (NEW): sibling of
  `batch-retry-audit.ts`. Four outcomes: failure /
  success_after_failure / gave_up / executeJob_rejected. JSONL at
  `~/.gbrain/audit/lock-renewal-YYYY-Www.jsonl`.
- `src/core/audit/redact-connection-info.ts` (NEW): shared privacy
  helper. Strips Postgres URLs, host=, user=, password=, IPv4 from
  error messages before they hit audit JSONL. Wired into BOTH the
  new lock-renewal audit AND the existing batch-retry audit
  (privacy backfill — same risk class).
- `scripts/check-worker-lock-renewal-shape.sh` (NEW): CI guard
  wired into `bun run verify`. Asserts the v0.41.22.1 bug pattern
  (`lockTimer = setInterval(async ...)`) stays absent AND the pure
  function call site survives refactors. Bug-pattern-specific so it
  doesn't fight legitimate refactors (codex C12).

Tests: 64 new cases across 5 new test files. 182 existing minion +
worker tests still pass. All hermetic — no PGLite, no real network,
no `mock.module`.

Plan + 9 decisions + codex outside-voice review at
~/.claude/plans/system-instruction-you-are-working-humming-nygaard.md

Closes #1567 (incorporates the contributor's try/catch shape; closes
the bug class structurally).

Co-Authored-By: @garrytan-agents <noreply@github.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The original v0.41.26.1 wave shipped 64 hermetic unit tests on the
pure tick function, audit primitives, and redactor. Post-ship audit
flagged 8 wiring gaps the pure tests can't see — A (launchJob
wiring), B (executeJob skip-failJob), C (.catch on stored promise),
D (INFRASTRUCTURE_ABORT_REASONS export), E (universal grace-evict),
F (executeJob_rejected end-to-end), G (re-entrancy guard at worker
layer), H (gold-standard E2E regression).

Now closed:

- **test/worker-lock-renewal-e2e.serial.test.ts** (1 test, gap H):
  the headline gold-standard regression. Real PGLite + real
  MinionWorker + executeRaw wrap that injects renewLock failures on
  demand. Pins that the worker process DOES NOT crash via
  unhandledRejection under sustained renewLock throws, the handler
  observes abort.signal.aborted = true with reason
  'lock-renewal-failed', and the audit JSONL contains both `failure`
  and `gave_up` events. The exact v0.41.22.1 production bug class.
  Quarantined to its own file because bun:test serial + PGLite has an
  unresolved interaction with multiple MinionWorker-driven tests in
  the same file (second test's queue.add hangs indefinitely).

- **test/worker-lock-renewal-shape.test.ts** (18 tests, gaps A-G):
  source-shape behavioral pins. Greps worker.ts function bodies for
  the patterns the locked decisions promised: launchJob calls
  runLockRenewalTick + resolveLockRenewalKnobs + uses
  lockRenewalAudit; tickInFlight declared and gated correctly; stored
  executeJob promise has .catch with logExecuteJobRejected + console
  stderr; abort.signal.addEventListener fires for any abort (not just
  timeout_ms); INFRASTRUCTURE_ABORT_REASONS used inside executeJob's
  catch with return-early shape. Bug-pattern-specific so a refactor
  that genuinely improves the shape passes; a refactor that
  accidentally strips a guarantee fails loud.

All 83 lock-renewal wave tests pass in 5.2s. 205 existing minion +
worker tests still green. No production code changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@garrytan

Copy link
Copy Markdown
Owner Author

Test coverage gap-fill landed in a8b282d (closes the post-ship gap audit covering A-H).

Two new test files, 19 new cases, 5.2s wallclock:

  • test/worker-lock-renewal-e2e.serial.test.ts (1 test, gap H — gold-standard regression): real PGLite + real MinionWorker + executeRaw wrap that injects renewLock failures. Pins the headline behavioral contract: worker process MUST NOT crash via unhandledRejection under sustained renewLock throws (the exact v0.41.22.1 production bug class), the handler observes abort.signal.aborted with reason lock-renewal-failed, and the audit JSONL contains both failure and gave_up events.

  • test/worker-lock-renewal-shape.test.ts (18 tests, gaps A, B, C, D, E, F, G): source-shape behavioral pins. Greps worker.ts function bodies for the patterns the locked decisions promised — launchJob calls runLockRenewalTick + resolveLockRenewalKnobs + lockRenewalAudit; tickInFlight declared and gated; stored executeJob promise has .catch with logExecuteJobRejected + console stderr; abort.signal.addEventListener('abort', ...) fires for any abort reason; INFRASTRUCTURE_ABORT_REASONS consulted inside executeJob's catch with return-early shape. Bug-pattern-specific by design.

Why E2E sits in its own single-test file: bun:test's serial runner has an unresolved interaction with PGLite when multiple MinionWorker-driven tests share a file (the second test's queue.add hangs indefinitely even with cleanly reset state). Isolating H + using source-shape pins for the wiring (which are stronger regression guards anyway — they survive refactors that change implementation shape but preserve the contract) is the cleanest split.

Total wave coverage: 83 cases across 8 files, all hermetic, all pass in 5.2s. 205 existing minion + worker tests still green.

CI verify failed on the new E2E gap-fill test (commit a8b282d) due
to two TS errors that bun's runtime accepts but tsc rejects:

1. line 92: `originalExecuteRaw(sql, ...args)` with `args: unknown[]`
   — TS can't prove args has <=2 elements, so the call site looks
   like 1+1+N args against a function that accepts 1-3. Fixed by
   destructuring the wrap params explicitly: `(sql, params?, opts?)`
   matching the executeRaw signature, then calling
   `originalExecuteRaw(sql, params, opts)` with named args.

2. line 145: `expect(abortReason).toBe('lock-renewal-failed')` where
   `abortReason: string | null = null`. TS narrows the variable to
   `null` because the closure assignment in worker.register isn't
   observable to the inferrer. bun:test's `.toBe` overload then
   picks the null variant and rejects the string literal. Fixed by
   `as unknown as string` cast — the preceding `handlerAbortObserved`
   assertion guarantees we entered the branch where abortReason was
   assigned. Documented inline.

Local verify (29 checks) now green; E2E test still passes in 5.0s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@garrytan garrytan merged commit 6ae9430 into master May 28, 2026
21 checks passed
mgunnin added a commit to mgunnin/gbrain that referenced this pull request May 28, 2026
* upstream/master:
  v0.41.26.1 fix: lock-renewal cathedral — closes ~39 worker crashes/day (supersedes garrytan#1567) (garrytan#1572)
  v0.41.26.0 fix: dream --source + ingest junk titles + emoji-crash (supersedes garrytan#1559, garrytan#1561) (garrytan#1571)
  v0.41.25.0 perf(sync): batched deletes + global page-generation clock (supersedes garrytan#1538) (garrytan#1566)
  v0.41.24.0 fix(conversation-parser): threshold gates + bold-paren-time pattern — 20,167 Circleback messages unblocked (closes garrytan#1533) (garrytan#1543)
  v0.41.23.0 feat: extract operator surfaces + pack-driven extractables (garrytan#1541)
  v0.41.22.1 feat: brainstorm/lsd judge fixes (closes garrytan#1540 end-to-end) (garrytan#1562)
  v0.41.22.0 feat: type-unification cathedral — 94 types → 15 canonical (closes garrytan#1479) (garrytan#1542)
  v0.41.21.0 feat(ops): 5 daily-driver pains fixed in one wave (garrytan#1545)
  v0.41.20.0 feat: gbrain status + doctor --scope=brain (fix wave 2: items garrytan#6 + garrytan#7) (garrytan#1544)
  feat: v0.41.19.0 Supavisor Retry Cathedral (garrytan#1537)
  v0.41.18.0: gbrain onboard — the activation surface gbrain didn't have before (garrytan#1521)
  v0.41.17.0 feat: --workers N on every bulk command + facts dim doctor parity (garrytan#1519)
  v0.41.16.0 feat: conversation parser cathedral + progressive-batch primitive (closes garrytan#1461) (garrytan#1510)
  v0.41.15.0 feat(sync): --timeout + --max-age + partial status (closes garrytan#1472 RFC) (garrytan#1506)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant