fix: start gRPC health server before leader election by mattisonchao · Pull Request #930 · oxia-db/oxia

mattisonchao · 2026-03-08T09:29:38Z

Summary

Fix coordinator sidecar health checks in Kubernetes StatefulSet mode where non-leader coordinators get killed by liveness probes.

Problem

In sidecar mode (3-pod StatefulSet), only the leader coordinator wins the Lease election. Non-leader coordinators block at WaitToBecomeLeader() inside NewCoordinator(). The gRPC server was started after NewCoordinator(), so the health endpoint was never reachable on non-leader pods. Kubernetes liveness probes fail and kill them in a crash loop.

Fix

Move health.NewServer() and StartGrpcServer() before NewCoordinator(). This is a pure reorder — no new logic added.

health.NewServer() in grpc-go automatically sets the default service ("") to SERVING, so no explicit SetServingStatus calls are needed. This matches the existing behavior of the standalone coordinator deployment, which has been running in production (109 days, 1 restart) without any manual health status calls.

Test plan

Deploy 3-pod coordinator StatefulSet on kind cluster — all pods 2/2 Running, 0 restarts
Chaos test: 5 random pod kills + leader kill — all pods recovered correctly
CI passes

In sidecar mode, NewCoordinator() blocks on WaitToBecomeLeader(). Non-leader coordinators never started the gRPC server, so Kubernetes liveness probes failed and killed the pods. Move health server and gRPC server creation before NewCoordinator() so that all coordinator pods respond to health checks immediately. health.NewServer() automatically sets the default service ("") to SERVING, so no explicit SetServingStatus calls are needed.

mattisonchao requested review from RobertIndie, coderzc and merlimat as code owners March 8, 2026 09:29

mattisonchao force-pushed the fix/coordinator-sidecar-health-check branch from ab0fcf0 to 537f329 Compare March 8, 2026 14:19

mattisonchao self-assigned this Mar 8, 2026

mattisonchao changed the title ~~fix: set coordinator health status to SERVING before leader election~~ fix: start gRPC health server before leader election Mar 8, 2026

mattisonchao merged commit 8533715 into main Mar 8, 2026
9 checks passed

mattisonchao deleted the fix/coordinator-sidecar-health-check branch March 8, 2026 14:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: start gRPC health server before leader election#930

fix: start gRPC health server before leader election#930
mattisonchao merged 1 commit intomainfrom
fix/coordinator-sidecar-health-check

mattisonchao commented Mar 8, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mattisonchao commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Fix

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mattisonchao commented Mar 8, 2026 •

edited

Loading