Skip to content

fix: start gRPC health server before leader election#930

Merged
mattisonchao merged 1 commit intomainfrom
fix/coordinator-sidecar-health-check
Mar 8, 2026
Merged

fix: start gRPC health server before leader election#930
mattisonchao merged 1 commit intomainfrom
fix/coordinator-sidecar-health-check

Conversation

@mattisonchao
Copy link
Copy Markdown
Member

@mattisonchao mattisonchao commented Mar 8, 2026

Summary

Fix coordinator sidecar health checks in Kubernetes StatefulSet mode where non-leader coordinators get killed by liveness probes.

Problem

In sidecar mode (3-pod StatefulSet), only the leader coordinator wins the Lease election. Non-leader coordinators block at WaitToBecomeLeader() inside NewCoordinator(). The gRPC server was started after NewCoordinator(), so the health endpoint was never reachable on non-leader pods. Kubernetes liveness probes fail and kill them in a crash loop.

Fix

Move health.NewServer() and StartGrpcServer() before NewCoordinator(). This is a pure reorder — no new logic added.

health.NewServer() in grpc-go automatically sets the default service ("") to SERVING, so no explicit SetServingStatus calls are needed. This matches the existing behavior of the standalone coordinator deployment, which has been running in production (109 days, 1 restart) without any manual health status calls.

Test plan

  • Deploy 3-pod coordinator StatefulSet on kind cluster — all pods 2/2 Running, 0 restarts
  • Chaos test: 5 random pod kills + leader kill — all pods recovered correctly
  • CI passes

In sidecar mode, NewCoordinator() blocks on WaitToBecomeLeader().
Non-leader coordinators never started the gRPC server, so Kubernetes
liveness probes failed and killed the pods.

Move health server and gRPC server creation before NewCoordinator()
so that all coordinator pods respond to health checks immediately.
health.NewServer() automatically sets the default service ("") to
SERVING, so no explicit SetServingStatus calls are needed.
@mattisonchao mattisonchao force-pushed the fix/coordinator-sidecar-health-check branch from ab0fcf0 to 537f329 Compare March 8, 2026 14:19
@mattisonchao mattisonchao self-assigned this Mar 8, 2026
@mattisonchao mattisonchao changed the title fix: set coordinator health status to SERVING before leader election fix: start gRPC health server before leader election Mar 8, 2026
@mattisonchao mattisonchao merged commit 8533715 into main Mar 8, 2026
9 checks passed
@mattisonchao mattisonchao deleted the fix/coordinator-sidecar-health-check branch March 8, 2026 14:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant