Skip to content

fix(jobs work): switch worker engine to instance-owned pool (Issue #1413 worker-side)#1578

Closed
notjbg wants to merge 1 commit into
garrytan:masterfrom
notjbg:fix/jobs-work-instance-pool
Closed

fix(jobs work): switch worker engine to instance-owned pool (Issue #1413 worker-side)#1578
notjbg wants to merge 1 commit into
garrytan:masterfrom
notjbg:fix/jobs-work-instance-pool

Conversation

@notjbg

@notjbg notjbg commented May 27, 2026

Copy link
Copy Markdown
Contributor

Summary

Forces the queue-worker subprocess engine into instance-owned-pool mode at startup, mirroring the pattern @archerhpagent shipped in #1415 for the autopilot main loop, but applied at the separate worker code path (bun gbrain jobs work --max-rss N).

The bug

When a transient Supabase pool eviction nulls the module-level sql singleton (the catch path at src/core/db.ts:216 sets sql = null on connect failure), the next firing of src/core/minions/worker.ts:619's lockTimer setInterval calls queue.ts:renewLockengine.executeRawdb.ts:getConnection, which throws:

GBrainError: No database connection: connect() has not been called. Fix: Run gbrain init --supabase or gbrain init --url <connection_string>
    at getConnection (src/core/db.ts:153:15)
    at executeRaw (src/core/postgres-engine.ts:4349:18)
    at renewLock (src/core/minions/queue.ts:1024:36)
    at <anonymous> (src/core/minions/worker.ts:619:40)

That either:

  1. Unhandled-reject-crashes the worker — autopilot KeepAlive respawns. Observed 2x in 90s on a real production brain.
  2. Degrades the worker silentlyqueue.poll() returns nothing, the cycle stops making progress, but the daemon process stays alive and gbrain autopilot --status reports installed. Observed 110+ minutes of zero ingest progression on the same brain before manual restart.

Both states observed on v0.41.18.0 against a 42k-page Supabase pooler brain (Autopilot run, extract.links_fs phase). Logs at #1413 (comment) (my comment in the same incident thread).

Relationship to #1415 and #1413

@archerhpagent's #1415 fixed the same module-singleton-null bug class for the autopilot main loop by forcing engine.connect({...savedConfig, poolSize: 5}) at autopilot startup. That PR was closed 2026-05-26 on the claim that v0.41.11.x autopilot.ts rewrite eliminated the bug — which is true for the autopilot path, but the queue worker subprocess is a separate process with its own engine instance and its own connectEngine() call. Per-subprocess engine initialization defaults to the module-singleton fallback (db.connect(config) → module-level sql shared with the worker's executeRaw callers), so the worker inherits the same bug class.

This PR closes the bug at the worker startup site. Same pattern, different code path.

The fix

if (engine.kind === 'postgres') {
  const eng = engine as unknown as {
    _savedConfig?: EngineConfig & { poolSize?: number };
    _connectionStyle?: 'instance' | 'module' | null;
  };
  if (eng._savedConfig && eng._connectionStyle !== 'instance') {
    try {
      await engine.connect({ ...eng._savedConfig, poolSize: 5 } as EngineConfig & { poolSize: number });
    } catch (e) {
      console.error(`[jobs work] could not switch engine to instance pool: ${(e as Error).message}`);
    }
  }
}

Inserted right before new MinionWorker(engine, {...}) in case 'work': (around jobs.ts:814). 14 LOC of executable code + 17 LOC of explanatory comment.

Properties:

  • Gate on engine.kind === 'postgres' — PGLite engines unaffected (they ignore poolSize).
  • Read _savedConfig via narrow structural cast (mirrors PR fix(autopilot): switch main engine to instance-owned pool so mid-cycle singleton-null doesn't lose rows + break getHealth #1415 style; no as any blast radius).
  • Skip when _connectionStyle === 'instance' — idempotent on re-run, and a no-op when a future caller wires the engine with poolSize upstream of case 'work':.
  • try/catch wraps the connect() call so a transient pool-init blip logs and continues rather than killing the worker startup. The error message names the daemon process the operator's eyes are already on.
  • No new dependencies. No new public API.

Verification

Confirmed locally on the same production brain that hit the bug:

  • Before patch: Worker subprocess crashes within 90s of restart with the trace above. 2 unhandled-rejection crashes in the first run. Subsequent restarts also crash within minutes. Brain stays 100+ minutes behind real-time because each catch-up sync hits the bug mid-extract.links_fs.
  • After patch: Autopilot log shows [jobs work] switched engine to instance-owned pool (size=5) once at startup. Worker survives extract.links_fs (42k pages) cleanly. New ingest entries land on schedule. (Monitor still running at PR-submit time — will update this PR with the final clean-runtime number when it lands.)

Tests

The bug is hard to unit-test deterministically — reproducing it needs real Supabase + PgBouncer transaction-mode + connection pool eviction (or a brittle simulation of the same). I did NOT add a behavioral test for the new code path; the static guard _connectionStyle !== 'instance' && _savedConfig defined → call engine.connect with poolSize is exercised by the engine's existing connect() tests via the poolSize branch.

Happy to add a static-shape test (mirror of test/autopilot-supervisor-wiring.test.ts) that asserts the patch wires engine.connect({..., poolSize: 5}) between queue.ensureSchema() and new MinionWorker(), if that'd help unblock review.

Test plan

  • Static-shape test for the patch (optional; see Tests section above)
  • Typecheck clean (bun run typecheck on the branch — no errors introduced)
  • Manual repro confirmation on the v0.41.18.0 → patched scenario

🤖 Generated with Claude Code

Forces the worker subprocess engine into instance-owned-pool mode at
startup so `_sql` is non-null and the module-level singleton fallback
at `db.ts:getConnection` stops being load-bearing for every
`executeRaw` call site inside the worker.

The specific failure: when a transient Supabase pool eviction nulls
the module-level `sql` singleton (db.ts:216's connect() catch path),
the next `worker.ts:619` lock-keeper setInterval fires
`queue.ts:renewLock` -> `engine.executeRaw` -> `db.ts:getConnection`
which throws `No database connection: connect() has not been called`.
That either unhandled-reject-crashes the worker (autopilot KeepAlive
respawns, observed 2x in 90s) or degrades it silently (queue.poll
returns nothing, cycle stops making progress while the daemon stays
alive, observed 110+ minutes).

Confirmed reproducible on v0.41.18.0 against a 42k-page Supabase
pooler brain. After this patch the same brain runs cleanly under the
same workload conditions.

Companion to Issue garrytan#1413 / closed PR garrytan#1415, which fixed the autopilot
main loop the same way but did not touch this worker-subprocess code
path. The worker is a separate process (\`bun gbrain jobs work
--max-rss N\`) with its own engine and its own connection lifecycle;
PR garrytan#1415's autopilot.ts startup hook does not reach it.

Same pattern as PR garrytan#1415:
- Gate on \`engine.kind === 'postgres'\` (PGLite unaffected)
- Read \`_savedConfig\` via narrow structural cast (not \`as any\`)
- Skip when \`_connectionStyle === 'instance'\` (idempotent on re-run
  or when caller already wired engine with poolSize upstream)
- Wrap in try/catch so a transient pool-init blip logs and continues
  rather than killing the worker startup

No new dependencies. ~14 LOC. Idempotent.
@notjbg

notjbg commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

Closing as obsolete: upstream v0.41.28.0's retry self-heal (reconnect() wired into batchRetry, #1570) supersedes this worker-side patch — verified in production that connection blips now recover at attempt 1/3 with zero row loss, so the instance-pool workaround is no longer needed.

@notjbg notjbg closed this Jun 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant