Skip to content

fix(minions): reconnect worker after promote connection loss#2025

Draft
maxpetrusenkoagent wants to merge 1 commit into
garrytan:masterfrom
maxpetrusenkoagent:hermes/oss-pr-2026-06-09-gbrain-1491
Draft

fix(minions): reconnect worker after promote connection loss#2025
maxpetrusenkoagent wants to merge 1 commit into
garrytan:masterfrom
maxpetrusenkoagent:hermes/oss-pr-2026-06-09-gbrain-1491

Conversation

@maxpetrusenkoagent

Copy link
Copy Markdown
Contributor

Summary

  • Fixes the minion worker poll loop so retryable promoteDelayed() connection failures trigger engine.reconnect() instead of only logging Promotion error: No database connection... forever.
  • Extracts the existing claim reconnect path into a shared worker helper, preserving claim's no-inline-retry safety rule.
  • Adds a regression test proving reconnect happens before the worker continues to claim work after a promotion-loop connection loss.

Closes #1491.

Why this is not a duplicate

I checked the closest open PRs before editing:

This PR targets the remaining worker-loop site in src/core/minions/worker.ts: an escaped retryable error from queue.promoteDelayed() was logged and ignored, leaving the worker in the repeated Promotion error: No database connection: connect() has not been called loop described in #1491.

Test plan

  • RED: bun test test/worker-promote-reconnect.test.ts failed before the fix with Expected: 1 reconnect call / Received: 0.
  • GREEN: bun test test/worker-promote-reconnect.test.ts
    • 1 pass, 0 fail
  • Focused regression/adjacent suite:
    • bun test test/worker-promote-reconnect.test.ts test/worker-shutdown-disconnect.test.ts test/minions.test.ts test/retry-matcher.test.ts test/core/retry.test.ts
    • 239 pass, 0 fail
  • Post-review focused suite:
    • bun test test/worker-promote-reconnect.test.ts test/queue-lock-retry.test.ts
    • 3 pass, 0 fail
  • bun run typecheck
    • pass
  • bun run build
    • pass
  • bun run verify
    • 29/30 checks passed
    • blocked by existing untouched test isolation lint in test/db-lock-heartbeat-takeover.test.ts for direct process.env.GBRAIN_LOCK_STEAL_GRACE_SECONDS mutation

Autoreview

  • Hermes subagent review: clean enough for draft PR. Non-blocking suggestion to assert reconnect-before-claim ordering was applied.
  • Codex review against exact diff: clean. It confirmed reconnect-on-promote is safe because promoteDelayed() is idempotent and claim still avoids inline retry.

Recover the worker-owned Postgres pool when promoteDelayed escapes a retryable connection error, preventing the repeated Promotion error: No database connection loop from issue garrytan#1491.\n\nAdds a regression test proving reconnect happens before the worker continues to claim work.
@maxpetrusenkoagent

Copy link
Copy Markdown
Contributor Author

Verification note after opening draft PR:

  • GitHub check suite: gh pr checks 2025 --repo garrytan/gbrain reports no checks on this fork branch.
  • Fallback review: ran explicit Hermes subagent review and Codex read-only review against the exact PR diff. Both returned no blocking findings and clean/enough-for-draft verdicts.
  • Applied the only non-blocking review suggestion: strengthened the regression test to assert reconnect happens before the worker continues to claim.
  • Local verification after that change:
    • bun test test/worker-promote-reconnect.test.ts test/queue-lock-retry.test.ts → 3 pass, 0 fail
    • bun run typecheck → pass

Full local gate evidence is in the PR body. bun run verify currently fails only on a pre-existing untouched isolation-lint violation in test/db-lock-heartbeat-takeover.test.ts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

autopilot: minion worker crashes with "No database connection: connect() has not been called"

1 participant