v0.41.26.1 fix: lock-renewal cathedral — closes ~39 worker crashes/day (supersedes #1567)#1572
Conversation
…y (supersedes #1567) Production worker daemons against Supabase / PgBouncer were crashing ~39 times/day with `unhandledRejection at renewLock`. PR #1567 proposed the right try/catch shape; this wave incorporates it and closes the entire bug class (4 inside-review + 8 outside-voice findings absorbed via 9 locked design decisions). What's fixed: - `setInterval(async () => await renewLock(...))` replaced with a sync wrapper around the new pure `runLockRenewalTick` function. No more unhandled rejections escaping the timer callback. - Second crash vector closed: `.catch()` on the stored `executeJob(...).finally(...)` promise so failJob/completeJob throws during the same outage can't propagate to `process.on('unhandledRejection')`. - Per-call `Promise.race` timeout (default `lockDuration/3`) bounds hung renewLock calls so the re-entrancy guard can't wedge indefinitely. - Time-based abort (NOT count-based) so the worker releases its lock BEFORE another worker can reclaim. With the prior 3-strike count + 30s lockDuration, a 15s window let other workers race. - Infrastructure aborts (`lock-renewal-failed`, `lock-lost`) don't burn job attempts — `executeJob`'s catch consults the exported `INFRASTRUCTURE_ABORT_REASONS` set and skips `failJob` so the stall detector reclaims cleanly. - Universal grace-eviction: 30s force-evict safety net now fires for ANY abort reason, not just `job.timeout_ms`. What's added: - `src/core/minions/lock-renewal-tick.ts` (NEW): pure extracted state-machine function + env-knob resolver. Three operator-tunable knobs via env (max-failures-for-audit, call-timeout-ms, safety-margin-ms) with stderr-warn-once on bad input + default fallback. - `src/core/audit/lock-renewal-audit.ts` (NEW): sibling of `batch-retry-audit.ts`. Four outcomes: failure / success_after_failure / gave_up / executeJob_rejected. JSONL at `~/.gbrain/audit/lock-renewal-YYYY-Www.jsonl`. - `src/core/audit/redact-connection-info.ts` (NEW): shared privacy helper. Strips Postgres URLs, host=, user=, password=, IPv4 from error messages before they hit audit JSONL. Wired into BOTH the new lock-renewal audit AND the existing batch-retry audit (privacy backfill — same risk class). - `scripts/check-worker-lock-renewal-shape.sh` (NEW): CI guard wired into `bun run verify`. Asserts the v0.41.22.1 bug pattern (`lockTimer = setInterval(async ...)`) stays absent AND the pure function call site survives refactors. Bug-pattern-specific so it doesn't fight legitimate refactors (codex C12). Tests: 64 new cases across 5 new test files. 182 existing minion + worker tests still pass. All hermetic — no PGLite, no real network, no `mock.module`. Plan + 9 decisions + codex outside-voice review at ~/.claude/plans/system-instruction-you-are-working-humming-nygaard.md Closes #1567 (incorporates the contributor's try/catch shape; closes the bug class structurally). Co-Authored-By: @garrytan-agents <noreply@github.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The original v0.41.26.1 wave shipped 64 hermetic unit tests on the pure tick function, audit primitives, and redactor. Post-ship audit flagged 8 wiring gaps the pure tests can't see — A (launchJob wiring), B (executeJob skip-failJob), C (.catch on stored promise), D (INFRASTRUCTURE_ABORT_REASONS export), E (universal grace-evict), F (executeJob_rejected end-to-end), G (re-entrancy guard at worker layer), H (gold-standard E2E regression). Now closed: - **test/worker-lock-renewal-e2e.serial.test.ts** (1 test, gap H): the headline gold-standard regression. Real PGLite + real MinionWorker + executeRaw wrap that injects renewLock failures on demand. Pins that the worker process DOES NOT crash via unhandledRejection under sustained renewLock throws, the handler observes abort.signal.aborted = true with reason 'lock-renewal-failed', and the audit JSONL contains both `failure` and `gave_up` events. The exact v0.41.22.1 production bug class. Quarantined to its own file because bun:test serial + PGLite has an unresolved interaction with multiple MinionWorker-driven tests in the same file (second test's queue.add hangs indefinitely). - **test/worker-lock-renewal-shape.test.ts** (18 tests, gaps A-G): source-shape behavioral pins. Greps worker.ts function bodies for the patterns the locked decisions promised: launchJob calls runLockRenewalTick + resolveLockRenewalKnobs + uses lockRenewalAudit; tickInFlight declared and gated correctly; stored executeJob promise has .catch with logExecuteJobRejected + console stderr; abort.signal.addEventListener fires for any abort (not just timeout_ms); INFRASTRUCTURE_ABORT_REASONS used inside executeJob's catch with return-early shape. Bug-pattern-specific so a refactor that genuinely improves the shape passes; a refactor that accidentally strips a guarantee fails loud. All 83 lock-renewal wave tests pass in 5.2s. 205 existing minion + worker tests still green. No production code changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Test coverage gap-fill landed in a8b282d (closes the post-ship gap audit covering A-H). Two new test files, 19 new cases, 5.2s wallclock:
Why E2E sits in its own single-test file: bun:test's serial runner has an unresolved interaction with PGLite when multiple MinionWorker-driven tests share a file (the second test's Total wave coverage: 83 cases across 8 files, all hermetic, all pass in 5.2s. 205 existing minion + worker tests still green. |
CI verify failed on the new E2E gap-fill test (commit a8b282d) due to two TS errors that bun's runtime accepts but tsc rejects: 1. line 92: `originalExecuteRaw(sql, ...args)` with `args: unknown[]` — TS can't prove args has <=2 elements, so the call site looks like 1+1+N args against a function that accepts 1-3. Fixed by destructuring the wrap params explicitly: `(sql, params?, opts?)` matching the executeRaw signature, then calling `originalExecuteRaw(sql, params, opts)` with named args. 2. line 145: `expect(abortReason).toBe('lock-renewal-failed')` where `abortReason: string | null = null`. TS narrows the variable to `null` because the closure assignment in worker.register isn't observable to the inferrer. bun:test's `.toBe` overload then picks the null variant and rejects the string literal. Fixed by `as unknown as string` cast — the preceding `handlerAbortObserved` assertion guarantees we entered the branch where abortReason was assigned. Documented inline. Local verify (29 checks) now green; E2E test still passes in 5.0s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* upstream/master: v0.41.26.1 fix: lock-renewal cathedral — closes ~39 worker crashes/day (supersedes garrytan#1567) (garrytan#1572) v0.41.26.0 fix: dream --source + ingest junk titles + emoji-crash (supersedes garrytan#1559, garrytan#1561) (garrytan#1571) v0.41.25.0 perf(sync): batched deletes + global page-generation clock (supersedes garrytan#1538) (garrytan#1566) v0.41.24.0 fix(conversation-parser): threshold gates + bold-paren-time pattern — 20,167 Circleback messages unblocked (closes garrytan#1533) (garrytan#1543) v0.41.23.0 feat: extract operator surfaces + pack-driven extractables (garrytan#1541) v0.41.22.1 feat: brainstorm/lsd judge fixes (closes garrytan#1540 end-to-end) (garrytan#1562) v0.41.22.0 feat: type-unification cathedral — 94 types → 15 canonical (closes garrytan#1479) (garrytan#1542) v0.41.21.0 feat(ops): 5 daily-driver pains fixed in one wave (garrytan#1545) v0.41.20.0 feat: gbrain status + doctor --scope=brain (fix wave 2: items garrytan#6 + garrytan#7) (garrytan#1544) feat: v0.41.19.0 Supavisor Retry Cathedral (garrytan#1537) v0.41.18.0: gbrain onboard — the activation surface gbrain didn't have before (garrytan#1521) v0.41.17.0 feat: --workers N on every bulk command + facts dim doctor parity (garrytan#1519) v0.41.16.0 feat: conversation parser cathedral + progressive-batch primitive (closes garrytan#1461) (garrytan#1510) v0.41.15.0 feat(sync): --timeout + --max-age + partial status (closes garrytan#1472 RFC) (garrytan#1506)
Summary
unhandledRejection at renewLock.executeJob(...).finally(...)promise had no.catch(), so failJob/completeJob throws during the same DB outage would crash the daemon too.redactConnectionInfohelper strips Postgres URLs/host/user/password/IPv4 from error messages before they hit disk. Wired into both new lock-renewal-audit AND existing batch-retry-audit (same risk class).Why this supersedes #1567
PR #1567 (contributor
@garrytan-agents) proposed the right try/catch shape but was missing:job.timeout_ms, not lock-renewal aborts).safeSliceEndUTF-16 surrogate fix — different bug class, filed as TODO-LR-1).The plan-eng-review surfaced 4 of these gaps inside, and codex outside-voice surfaced the other 4. All 8 absorbed via 9 locked design decisions (D1-D9) captured in the plan file.
Architecture
Pure-function extraction is the load-bearing change:
src/core/minions/lock-renewal-tick.ts—runLockRenewalTick(deps, state)returns a tagged union ('ok' | 'cancelled' | 'lock_lost' | 'should_abort'). All side effects (clearInterval, abort.abort, finally cleanup) stay in the worker's closure scope. Tests drive the state machine with fake deps — nosetIntervalplumbing, no PGLite, nomock.module.worker.ts:launchJobreduces to a thinsetInterval(() => void runLockRenewalTick(...))sync wrapper.scripts/check-worker-lock-renewal-shape.shCI guard wires intobun run verify. Asserts the bug pattern (lockTimer = setInterval(async ...)) stays absent AND the call site survives refactors. Narrowed vialockTimer =prefix so unrelatedsetInterval(async)calls elsewhere in the file (e.g. stall detector — codex C13, filed separately as TODO-LR-4) don't false-fire.Test plan
runLockRenewalTick(state machine paths, cancellation, time-based abort, hung-renewLock timeout, audit defense-in-depth, env-knob resolution)batch-retry-audit(snapshot JSONL — no DSN/host/password)bun run verifygreen (29 checks including the new shape guard)Total: 64 new cases + 182 existing tests verified. All hermetic.
Test failures in the parallel suite (longmemeval timeouts, schema-pack timeouts, bootstrap PGLite cold-start timeouts, query-cache flakes, eval-prune timeouts) are pre-existing infrastructure-flake under heavy load — none touch any code in this PR.
Operator verification after upgrade
Closes
🤖 Generated with Claude Code