fix: start gRPC health server before leader election#930
Merged
mattisonchao merged 1 commit intomainfrom Mar 8, 2026
Merged
Conversation
In sidecar mode, NewCoordinator() blocks on WaitToBecomeLeader().
Non-leader coordinators never started the gRPC server, so Kubernetes
liveness probes failed and killed the pods.
Move health server and gRPC server creation before NewCoordinator()
so that all coordinator pods respond to health checks immediately.
health.NewServer() automatically sets the default service ("") to
SERVING, so no explicit SetServingStatus calls are needed.
ab0fcf0 to
537f329
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fix coordinator sidecar health checks in Kubernetes StatefulSet mode where non-leader coordinators get killed by liveness probes.
Problem
In sidecar mode (3-pod StatefulSet), only the leader coordinator wins the Lease election. Non-leader coordinators block at
WaitToBecomeLeader()insideNewCoordinator(). The gRPC server was started afterNewCoordinator(), so the health endpoint was never reachable on non-leader pods. Kubernetes liveness probes fail and kill them in a crash loop.Fix
Move
health.NewServer()andStartGrpcServer()beforeNewCoordinator(). This is a pure reorder — no new logic added.health.NewServer()in grpc-go automatically sets the default service ("") toSERVING, so no explicitSetServingStatuscalls are needed. This matches the existing behavior of the standalone coordinator deployment, which has been running in production (109 days, 1 restart) without any manual health status calls.Test plan