fix: catch unhandled rejection in worker lock-renewal timer#1567
fix: catch unhandled rejection in worker lock-renewal timer#1567garrytan-agents wants to merge 3 commits into
Conversation
The judgeSignificance trimming (slice at 4000 chars) could split a UTF-16 surrogate pair when an emoji sits exactly at the boundary, producing a lone high surrogate that Anthropic's JSON parser rejects with 'no low surrogate in string'. Add safeSliceEnd() helper that backs up by one char when the cut lands between a high and low surrogate. Apply to: - judgeSignificance transcript trimming (the direct cause) - findBoundary hard-split fallback (defense-in-depth) Fixes: dream cycle SYNTH_PHASE_FAIL on 2026-05-24 caused by 🤖 emoji at pos 3999 in telegram/2026-05-20-topic-1-topic-1.md
The per-job lock renewal runs in a setInterval callback. When PgBouncer drops the DB connection (transient network blip, pool rotation, etc.), renewLock → executeRaw → getConnection throws. Since this runs in a detached async context inside setInterval, the rejection is unhandled and kills the entire worker process with exit code 1. This is the root cause of the 39 crashes/day pattern: workers run ~30 min (time for PgBouncer to rotate the connection), then crash on the next lock renewal tick. The supervisor respawns, but stableRunResetMs (5 min) resets the crash counter since each worker runs 30+ min — so it never triggers max_crashes protection. Fix: wrap the lock renewal in try/catch. On transient DB failure, log a warning and retry on the next interval tick. After 3 consecutive failures, treat as a lost lock and abort the job (stall detector will requeue it). The worker process stays alive.
|
Superseded by #1572: v0.41.26.1 lock-renewal cathedral wave. Thanks for the original fix — your try/catch shape made it into the new wave verbatim, with co-author attribution preserved on the commit. Codex outside-voice review caught 8 additional gaps that the inside-review missed (notably: the count-based threshold had a 15-second window where another worker could reclaim the row BEFORE this worker aborted, and the stored The |
…y (supersedes #1567) (#1572) * v0.41.26.1 fix: lock-renewal cathedral — closes ~39 worker crashes/day (supersedes #1567) Production worker daemons against Supabase / PgBouncer were crashing ~39 times/day with `unhandledRejection at renewLock`. PR #1567 proposed the right try/catch shape; this wave incorporates it and closes the entire bug class (4 inside-review + 8 outside-voice findings absorbed via 9 locked design decisions). What's fixed: - `setInterval(async () => await renewLock(...))` replaced with a sync wrapper around the new pure `runLockRenewalTick` function. No more unhandled rejections escaping the timer callback. - Second crash vector closed: `.catch()` on the stored `executeJob(...).finally(...)` promise so failJob/completeJob throws during the same outage can't propagate to `process.on('unhandledRejection')`. - Per-call `Promise.race` timeout (default `lockDuration/3`) bounds hung renewLock calls so the re-entrancy guard can't wedge indefinitely. - Time-based abort (NOT count-based) so the worker releases its lock BEFORE another worker can reclaim. With the prior 3-strike count + 30s lockDuration, a 15s window let other workers race. - Infrastructure aborts (`lock-renewal-failed`, `lock-lost`) don't burn job attempts — `executeJob`'s catch consults the exported `INFRASTRUCTURE_ABORT_REASONS` set and skips `failJob` so the stall detector reclaims cleanly. - Universal grace-eviction: 30s force-evict safety net now fires for ANY abort reason, not just `job.timeout_ms`. What's added: - `src/core/minions/lock-renewal-tick.ts` (NEW): pure extracted state-machine function + env-knob resolver. Three operator-tunable knobs via env (max-failures-for-audit, call-timeout-ms, safety-margin-ms) with stderr-warn-once on bad input + default fallback. - `src/core/audit/lock-renewal-audit.ts` (NEW): sibling of `batch-retry-audit.ts`. Four outcomes: failure / success_after_failure / gave_up / executeJob_rejected. JSONL at `~/.gbrain/audit/lock-renewal-YYYY-Www.jsonl`. - `src/core/audit/redact-connection-info.ts` (NEW): shared privacy helper. Strips Postgres URLs, host=, user=, password=, IPv4 from error messages before they hit audit JSONL. Wired into BOTH the new lock-renewal audit AND the existing batch-retry audit (privacy backfill — same risk class). - `scripts/check-worker-lock-renewal-shape.sh` (NEW): CI guard wired into `bun run verify`. Asserts the v0.41.22.1 bug pattern (`lockTimer = setInterval(async ...)`) stays absent AND the pure function call site survives refactors. Bug-pattern-specific so it doesn't fight legitimate refactors (codex C12). Tests: 64 new cases across 5 new test files. 182 existing minion + worker tests still pass. All hermetic — no PGLite, no real network, no `mock.module`. Plan + 9 decisions + codex outside-voice review at ~/.claude/plans/system-instruction-you-are-working-humming-nygaard.md Closes #1567 (incorporates the contributor's try/catch shape; closes the bug class structurally). Co-Authored-By: @garrytan-agents <noreply@github.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: fill A-H gaps for v0.41.26.1 lock-renewal cathedral The original v0.41.26.1 wave shipped 64 hermetic unit tests on the pure tick function, audit primitives, and redactor. Post-ship audit flagged 8 wiring gaps the pure tests can't see — A (launchJob wiring), B (executeJob skip-failJob), C (.catch on stored promise), D (INFRASTRUCTURE_ABORT_REASONS export), E (universal grace-evict), F (executeJob_rejected end-to-end), G (re-entrancy guard at worker layer), H (gold-standard E2E regression). Now closed: - **test/worker-lock-renewal-e2e.serial.test.ts** (1 test, gap H): the headline gold-standard regression. Real PGLite + real MinionWorker + executeRaw wrap that injects renewLock failures on demand. Pins that the worker process DOES NOT crash via unhandledRejection under sustained renewLock throws, the handler observes abort.signal.aborted = true with reason 'lock-renewal-failed', and the audit JSONL contains both `failure` and `gave_up` events. The exact v0.41.22.1 production bug class. Quarantined to its own file because bun:test serial + PGLite has an unresolved interaction with multiple MinionWorker-driven tests in the same file (second test's queue.add hangs indefinitely). - **test/worker-lock-renewal-shape.test.ts** (18 tests, gaps A-G): source-shape behavioral pins. Greps worker.ts function bodies for the patterns the locked decisions promised: launchJob calls runLockRenewalTick + resolveLockRenewalKnobs + uses lockRenewalAudit; tickInFlight declared and gated correctly; stored executeJob promise has .catch with logExecuteJobRejected + console stderr; abort.signal.addEventListener fires for any abort (not just timeout_ms); INFRASTRUCTURE_ABORT_REASONS used inside executeJob's catch with return-early shape. Bug-pattern-specific so a refactor that genuinely improves the shape passes; a refactor that accidentally strips a guarantee fails loud. All 83 lock-renewal wave tests pass in 5.2s. 205 existing minion + worker tests still green. No production code changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: typecheck errors in worker-lock-renewal-e2e.serial.test.ts CI verify failed on the new E2E gap-fill test (commit a8b282d) due to two TS errors that bun's runtime accepts but tsc rejects: 1. line 92: `originalExecuteRaw(sql, ...args)` with `args: unknown[]` — TS can't prove args has <=2 elements, so the call site looks like 1+1+N args against a function that accepts 1-3. Fixed by destructuring the wrap params explicitly: `(sql, params?, opts?)` matching the executeRaw signature, then calling `originalExecuteRaw(sql, params, opts)` with named args. 2. line 145: `expect(abortReason).toBe('lock-renewal-failed')` where `abortReason: string | null = null`. TS narrows the variable to `null` because the closure assignment in worker.register isn't observable to the inferrer. bun:test's `.toBe` overload then picks the null variant and rejects the string literal. Fixed by `as unknown as string` cast — the preceding `handlerAbortObserved` assertion guarantees we entered the branch where abortReason was assigned. Documented inline. Local verify (29 checks) now green; E2E test still passes in 5.0s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: @garrytan-agents <noreply@github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* upstream/master: v0.41.26.1 fix: lock-renewal cathedral — closes ~39 worker crashes/day (supersedes garrytan#1567) (garrytan#1572) v0.41.26.0 fix: dream --source + ingest junk titles + emoji-crash (supersedes garrytan#1559, garrytan#1561) (garrytan#1571) v0.41.25.0 perf(sync): batched deletes + global page-generation clock (supersedes garrytan#1538) (garrytan#1566) v0.41.24.0 fix(conversation-parser): threshold gates + bold-paren-time pattern — 20,167 Circleback messages unblocked (closes garrytan#1533) (garrytan#1543) v0.41.23.0 feat: extract operator surfaces + pack-driven extractables (garrytan#1541) v0.41.22.1 feat: brainstorm/lsd judge fixes (closes garrytan#1540 end-to-end) (garrytan#1562) v0.41.22.0 feat: type-unification cathedral — 94 types → 15 canonical (closes garrytan#1479) (garrytan#1542) v0.41.21.0 feat(ops): 5 daily-driver pains fixed in one wave (garrytan#1545) v0.41.20.0 feat: gbrain status + doctor --scope=brain (fix wave 2: items garrytan#6 + garrytan#7) (garrytan#1544) feat: v0.41.19.0 Supavisor Retry Cathedral (garrytan#1537) v0.41.18.0: gbrain onboard — the activation surface gbrain didn't have before (garrytan#1521) v0.41.17.0 feat: --workers N on every bulk command + facts dim doctor parity (garrytan#1519) v0.41.16.0 feat: conversation parser cathedral + progressive-batch primitive (closes garrytan#1461) (garrytan#1510) v0.41.15.0 feat(sync): --timeout + --max-age + partial status (closes garrytan#1472 RFC) (garrytan#1506)
Root Cause
Caught the crash in the act:
The per-job lock renewal runs in a
setIntervalcallback (worker.ts:618). When PgBouncer drops/rotates the DB connection,renewLock → executeRaw → getConnectionthrows. Since this is a detached async context insidesetInterval, the rejection is unhandled and kills the worker process with exit code 1.This explains the pattern:
stableRunResetMs(5 min) resets crash counter because each worker runs 30+ min → never hitsmax_crashescode=1 (runtime_error)Fix
Wrap the lock renewal
awaitin try/catch:Impact
Eliminates the #1 source of supervisor crashes. Zero-risk: the fallback behavior (abort after 3 failures) is strictly better than the current behavior (process death on first failure).