fix(minions): reconnect worker after promote connection loss#2025
Draft
maxpetrusenkoagent wants to merge 1 commit into
Draft
fix(minions): reconnect worker after promote connection loss#2025maxpetrusenkoagent wants to merge 1 commit into
maxpetrusenkoagent wants to merge 1 commit into
Conversation
Recover the worker-owned Postgres pool when promoteDelayed escapes a retryable connection error, preventing the repeated Promotion error: No database connection loop from issue garrytan#1491.\n\nAdds a regression test proving reconnect happens before the worker continues to claim work.
Contributor
Author
|
Verification note after opening draft PR:
Full local gate evidence is in the PR body. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
promoteDelayed()connection failures triggerengine.reconnect()instead of only loggingPromotion error: No database connection...forever.Closes #1491.
Why this is not a duplicate
I checked the closest open PRs before editing:
src/commands/autopilot.tshealth-check recovery so autopilot callsengine.reconnect()instead ofconnect()with no config.PostgresEngine.reconnect()itself with build-then-swap semantics.This PR targets the remaining worker-loop site in
src/core/minions/worker.ts: an escaped retryable error fromqueue.promoteDelayed()was logged and ignored, leaving the worker in the repeatedPromotion error: No database connection: connect() has not been calledloop described in #1491.Test plan
bun test test/worker-promote-reconnect.test.tsfailed before the fix withExpected: 1 reconnect call / Received: 0.bun test test/worker-promote-reconnect.test.tsbun test test/worker-promote-reconnect.test.ts test/worker-shutdown-disconnect.test.ts test/minions.test.ts test/retry-matcher.test.ts test/core/retry.test.tsbun test test/worker-promote-reconnect.test.ts test/queue-lock-retry.test.tsbun run typecheckbun run buildbun run verifytest/db-lock-heartbeat-takeover.test.tsfor directprocess.env.GBRAIN_LOCK_STEAL_GRACE_SECONDSmutationAutoreview
promoteDelayed()is idempotent and claim still avoids inline retry.