fix: only call GetInfo RPC on state transition, not every health check#933
fix: only call GetInfo RPC on state transition, not every health check#933mattisonchao merged 11 commits intomainfrom
Conversation
becomeAvailable() was calling syncDataServerInfo() on every successful health check (~every 2s), even when the server was already Running. This caused continuous redundant GetInfo RPCs and log noise. Now GetInfo is only called when transitioning from NotRunning to Running. Closes #932 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds TestDataServerController_GetInfoOnlyCalledOnStateTransition which verifies GetInfo is not called repeatedly while the server is already Running, and is only triggered on NotRunning -> Running transitions. Also adds initial syncDataServerInfo() call in the constructor since the controller starts in Running state and the fix skips GetInfo for already-Running servers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move syncDataServerInfo inside the NotRunning transition path directly and early-return when status is already Running, removing the wasNotRunning flag variable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…utine Remove the separate initial-info-sync goroutine. Instead, start the controller with NotRunning status so the first successful health check naturally triggers the NotRunning -> Running transition and calls syncDataServerInfo(). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
becomeAvailable() was resetting healthCheckBackoff and healthWatchBackoff which are owned by other goroutines (healthPingWithRetries and healthWatchWithRetries respectively). This caused a data race when both goroutines triggered becomeAvailable() concurrently on the first health check. The resets are redundant since backoff.RetryNotify already calls Reset() internally at the start of each retry cycle. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The healthCheckBackoff and healthWatchBackoff fields were shared across goroutines: becomeAvailable() reset them while healthPingWithRetries and healthWatchWithRetries used them concurrently via backoff.RetryNotify. Fix by creating backoff objects locally inside each retry function so each goroutine owns its own backoff instance. Also fix test timing: wait for GetInfo count to increase rather than just status change, since syncDataServerInfo runs after status is set to Running. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add ConcurrentBackOff, a mutex-wrapped backoff.BackOff, to make it safe for concurrent use. Both healthPingWithRetries and healthWatchWithRetries now share a single ConcurrentBackOff instance, allowing becomeAvailable() to safely reset it from either goroutine without data races. This replaces the previous approach of creating separate local backoff objects per goroutine, which lost the ability to reset both on recovery. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
mattisonchao
left a comment
There was a problem hiding this comment.
Self-review — ready for review.
Summary of changes:
-
Fix redundant GetInfo RPCs:
becomeAvailable()now early-returns when status is alreadyRunning, sosyncDataServerInfo()only runs onNotRunning→Runningtransitions. This eliminates the ~2s polling of GetInfo on every health check. -
Start with
NotRunning: Controller initial status changed fromRunningtoNotRunning, so the first health check naturally triggers the transition and initialGetInfocall — no extra init goroutine needed. -
Fix data race on backoff: Introduced
ConcurrentBackOff(mutex-wrappedbackoff.BackOff) shared by bothhealthPingWithRetriesandhealthWatchWithRetries. This fixes a latent race wherebecomeAvailable()reset backoff objects concurrently withbackoff.RetryNotify— previously hidden because the controller started asRunningand never entered the reset path on first health check. -
Test:
TestDataServerController_GetInfoOnlyCalledOnStateTransitionverifies GetInfo count stays stable while Running and only increases on state transitions.
All tests pass with -race.
The 5s timeout was too tight for CI runners — the test needs health check (~2s), status transition, assignment dispatch, and gRPC response to all complete before the leader hint is available on a non-leader node. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
With the NotRunning initial status, the coordinator needs health checks to pass before nodes are considered available. Under CI load with race detector, the existing 10s timeouts are too tight. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The health ping goroutine previously waited 2s (ticker interval) before the first check. With NotRunning initial status, this added unnecessary startup delay. Now does an immediate check before entering the ticker loop. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Motivation
The coordinator's
dataServerControllersends aGetInfoRPC to every data server on every successful health check (~every 2s), flooding logs withReceived GetInfo requestand generating unnecessary network traffic.GetInfoonly needs to be called when a server transitions fromNotRunningtoRunning.Modifications
becomeAvailable()→ early return: When status is alreadyRunning, skipsyncDataServerInfo(). Only callGetInfoon actualNotRunning→Runningtransitions.NotRunning: The first health check naturally triggers the transition and initialGetInfocall — no extra init goroutine needed.ConcurrentBackOff: Mutex-wrappedbackoff.BackOffshared by bothhealthPingWithRetriesandhealthWatchWithRetries. Fixes a latent data race wherebecomeAvailable()reset backoff objects concurrently withbackoff.RetryNotify.TestLeaderHint*timeouts increased from 10s to 20s for CI stability.Closes #932
Test plan
-raceTestDataServerController_GetInfoOnlyCalledOnStateTransitionverifiesGetInfoonly fires onNotRunning→Runningtransitions and stays stable while alreadyRunning