Skip to content

fix: only call GetInfo RPC on state transition, not every health check#933

Merged
mattisonchao merged 11 commits intomainfrom
fix/redundant-getinfo-on-health-check
Mar 8, 2026
Merged

fix: only call GetInfo RPC on state transition, not every health check#933
mattisonchao merged 11 commits intomainfrom
fix/redundant-getinfo-on-health-check

Conversation

@mattisonchao
Copy link
Copy Markdown
Member

@mattisonchao mattisonchao commented Mar 8, 2026

Motivation

The coordinator's dataServerController sends a GetInfo RPC to every data server on every successful health check (~every 2s), flooding logs with Received GetInfo request and generating unnecessary network traffic. GetInfo only needs to be called when a server transitions from NotRunning to Running.

Modifications

  • becomeAvailable() → early return: When status is already Running, skip syncDataServerInfo(). Only call GetInfo on actual NotRunningRunning transitions.
  • Start controller as NotRunning: The first health check naturally triggers the transition and initial GetInfo call — no extra init goroutine needed.
  • ConcurrentBackOff: Mutex-wrapped backoff.BackOff shared by both healthPingWithRetries and healthWatchWithRetries. Fixes a latent data race where becomeAvailable() reset backoff objects concurrently with backoff.RetryNotify.
  • Immediate health ping: Do an immediate health check on startup instead of waiting for the 2s ticker, eliminating unnecessary startup delay.
  • Increased test timeouts: TestLeaderHint* timeouts increased from 10s to 20s for CI stability.

Closes #932

Test plan

  • All tests pass with -race
  • New TestDataServerController_GetInfoOnlyCalledOnStateTransition verifies GetInfo only fires on NotRunningRunning transitions and stays stable while already Running

becomeAvailable() was calling syncDataServerInfo() on every successful
health check (~every 2s), even when the server was already Running.
This caused continuous redundant GetInfo RPCs and log noise.

Now GetInfo is only called when transitioning from NotRunning to Running.

Closes #932

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
mattisonchao and others added 4 commits March 8, 2026 22:39
Adds TestDataServerController_GetInfoOnlyCalledOnStateTransition which
verifies GetInfo is not called repeatedly while the server is already
Running, and is only triggered on NotRunning -> Running transitions.

Also adds initial syncDataServerInfo() call in the constructor since
the controller starts in Running state and the fix skips GetInfo for
already-Running servers.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move syncDataServerInfo inside the NotRunning transition path directly
and early-return when status is already Running, removing the
wasNotRunning flag variable.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…utine

Remove the separate initial-info-sync goroutine. Instead, start the
controller with NotRunning status so the first successful health check
naturally triggers the NotRunning -> Running transition and calls
syncDataServerInfo().

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mattisonchao mattisonchao self-assigned this Mar 8, 2026
mattisonchao and others added 3 commits March 8, 2026 23:17
becomeAvailable() was resetting healthCheckBackoff and healthWatchBackoff
which are owned by other goroutines (healthPingWithRetries and
healthWatchWithRetries respectively). This caused a data race when both
goroutines triggered becomeAvailable() concurrently on the first health
check. The resets are redundant since backoff.RetryNotify already calls
Reset() internally at the start of each retry cycle.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The healthCheckBackoff and healthWatchBackoff fields were shared across
goroutines: becomeAvailable() reset them while healthPingWithRetries and
healthWatchWithRetries used them concurrently via backoff.RetryNotify.

Fix by creating backoff objects locally inside each retry function so
each goroutine owns its own backoff instance. Also fix test timing:
wait for GetInfo count to increase rather than just status change,
since syncDataServerInfo runs after status is set to Running.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add ConcurrentBackOff, a mutex-wrapped backoff.BackOff, to make it safe
for concurrent use. Both healthPingWithRetries and healthWatchWithRetries
now share a single ConcurrentBackOff instance, allowing becomeAvailable()
to safely reset it from either goroutine without data races.

This replaces the previous approach of creating separate local backoff
objects per goroutine, which lost the ability to reset both on recovery.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Member Author

@mattisonchao mattisonchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self-review — ready for review.

Summary of changes:

  1. Fix redundant GetInfo RPCs: becomeAvailable() now early-returns when status is already Running, so syncDataServerInfo() only runs on NotRunningRunning transitions. This eliminates the ~2s polling of GetInfo on every health check.

  2. Start with NotRunning: Controller initial status changed from Running to NotRunning, so the first health check naturally triggers the transition and initial GetInfo call — no extra init goroutine needed.

  3. Fix data race on backoff: Introduced ConcurrentBackOff (mutex-wrapped backoff.BackOff) shared by both healthPingWithRetries and healthWatchWithRetries. This fixes a latent race where becomeAvailable() reset backoff objects concurrently with backoff.RetryNotify — previously hidden because the controller started as Running and never entered the reset path on first health check.

  4. Test: TestDataServerController_GetInfoOnlyCalledOnStateTransition verifies GetInfo count stays stable while Running and only increases on state transitions.

All tests pass with -race.

mattisonchao and others added 3 commits March 9, 2026 00:06
The 5s timeout was too tight for CI runners — the test needs health
check (~2s), status transition, assignment dispatch, and gRPC response
to all complete before the leader hint is available on a non-leader node.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
With the NotRunning initial status, the coordinator needs health checks
to pass before nodes are considered available. Under CI load with race
detector, the existing 10s timeouts are too tight.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The health ping goroutine previously waited 2s (ticker interval) before
the first check. With NotRunning initial status, this added unnecessary
startup delay. Now does an immediate check before entering the ticker
loop.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mattisonchao mattisonchao merged commit 4db3949 into main Mar 8, 2026
9 checks passed
@mattisonchao mattisonchao deleted the fix/redundant-getinfo-on-health-check branch March 8, 2026 18:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Redundant GetInfo RPC called on every health check instead of only on state transitions

2 participants