fix: only call GetInfo RPC on state transition, not every health check by mattisonchao · Pull Request #933 · oxia-db/oxia

mattisonchao · 2026-03-08T14:35:10Z

Motivation

The coordinator's dataServerController sends a GetInfo RPC to every data server on every successful health check (~every 2s), flooding logs with Received GetInfo request and generating unnecessary network traffic. GetInfo only needs to be called when a server transitions from NotRunning to Running.

Modifications

becomeAvailable() → early return: When status is already Running, skip syncDataServerInfo(). Only call GetInfo on actual NotRunning → Running transitions.
Start controller as NotRunning: The first health check naturally triggers the transition and initial GetInfo call — no extra init goroutine needed.
ConcurrentBackOff: Mutex-wrapped backoff.BackOff shared by both healthPingWithRetries and healthWatchWithRetries. Fixes a latent data race where becomeAvailable() reset backoff objects concurrently with backoff.RetryNotify.
Immediate health ping: Do an immediate health check on startup instead of waiting for the 2s ticker, eliminating unnecessary startup delay.
Increased test timeouts: TestLeaderHint* timeouts increased from 10s to 20s for CI stability.

Closes #932

Test plan

All tests pass with -race
New TestDataServerController_GetInfoOnlyCalledOnStateTransition verifies GetInfo only fires on NotRunning → Running transitions and stays stable while already Running

becomeAvailable() was calling syncDataServerInfo() on every successful health check (~every 2s), even when the server was already Running. This caused continuous redundant GetInfo RPCs and log noise. Now GetInfo is only called when transitioning from NotRunning to Running. Closes #932 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Adds TestDataServerController_GetInfoOnlyCalledOnStateTransition which verifies GetInfo is not called repeatedly while the server is already Running, and is only triggered on NotRunning -> Running transitions. Also adds initial syncDataServerInfo() call in the constructor since the controller starts in Running state and the fix skips GetInfo for already-Running servers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Move syncDataServerInfo inside the NotRunning transition path directly and early-return when status is already Running, removing the wasNotRunning flag variable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…utine Remove the separate initial-info-sync goroutine. Instead, start the controller with NotRunning status so the first successful health check naturally triggers the NotRunning -> Running transition and calls syncDataServerInfo(). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

becomeAvailable() was resetting healthCheckBackoff and healthWatchBackoff which are owned by other goroutines (healthPingWithRetries and healthWatchWithRetries respectively). This caused a data race when both goroutines triggered becomeAvailable() concurrently on the first health check. The resets are redundant since backoff.RetryNotify already calls Reset() internally at the start of each retry cycle. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The healthCheckBackoff and healthWatchBackoff fields were shared across goroutines: becomeAvailable() reset them while healthPingWithRetries and healthWatchWithRetries used them concurrently via backoff.RetryNotify. Fix by creating backoff objects locally inside each retry function so each goroutine owns its own backoff instance. Also fix test timing: wait for GetInfo count to increase rather than just status change, since syncDataServerInfo runs after status is set to Running. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add ConcurrentBackOff, a mutex-wrapped backoff.BackOff, to make it safe for concurrent use. Both healthPingWithRetries and healthWatchWithRetries now share a single ConcurrentBackOff instance, allowing becomeAvailable() to safely reset it from either goroutine without data races. This replaces the previous approach of creating separate local backoff objects per goroutine, which lost the ability to reset both on recovery. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

mattisonchao

Self-review — ready for review.

Summary of changes:

Fix redundant GetInfo RPCs: becomeAvailable() now early-returns when status is already Running, so syncDataServerInfo() only runs on NotRunning → Running transitions. This eliminates the ~2s polling of GetInfo on every health check.
Start with NotRunning: Controller initial status changed from Running to NotRunning, so the first health check naturally triggers the transition and initial GetInfo call — no extra init goroutine needed.
Fix data race on backoff: Introduced ConcurrentBackOff (mutex-wrapped backoff.BackOff) shared by both healthPingWithRetries and healthWatchWithRetries. This fixes a latent race where becomeAvailable() reset backoff objects concurrently with backoff.RetryNotify — previously hidden because the controller started as Running and never entered the reset path on first health check.
Test: TestDataServerController_GetInfoOnlyCalledOnStateTransition verifies GetInfo count stays stable while Running and only increases on state transitions.

All tests pass with -race.

The 5s timeout was too tight for CI runners — the test needs health check (~2s), status transition, assignment dispatch, and gRPC response to all complete before the leader hint is available on a non-leader node. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

With the NotRunning initial status, the coordinator needs health checks to pass before nodes are considered available. Under CI load with race detector, the existing 10s timeouts are too tight. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The health ping goroutine previously waited 2s (ticker interval) before the first check. With NotRunning initial status, this added unnecessary startup delay. Now does an immediate check before entering the ticker loop. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

mattisonchao requested review from RobertIndie, coderzc and merlimat as code owners March 8, 2026 14:35

mattisonchao and others added 4 commits March 8, 2026 22:39

refactor: simplify becomeAvailable with early return instead of flag

58d2f0c

Move syncDataServerInfo inside the NotRunning transition path directly and early-return when status is already Running, removing the wasNotRunning flag variable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Merge branch 'main' into fix/redundant-getinfo-on-health-check

b97e364

mattisonchao self-assigned this Mar 8, 2026

mattisonchao and others added 3 commits March 8, 2026 23:17

mattisonchao commented Mar 8, 2026

View reviewed changes

mattisonchao mentioned this pull request Mar 8, 2026

Flaky tests: TestLeaderHintWithoutClient and TestLeaderBalancedNodeAdded #936

Closed

merlimat approved these changes Mar 8, 2026

View reviewed changes

mattisonchao and others added 3 commits March 9, 2026 00:06

mattisonchao merged commit 4db3949 into main Mar 8, 2026
9 checks passed

mattisonchao deleted the fix/redundant-getinfo-on-health-check branch March 8, 2026 18:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: only call GetInfo RPC on state transition, not every health check#933

fix: only call GetInfo RPC on state transition, not every health check#933
mattisonchao merged 11 commits intomainfrom
fix/redundant-getinfo-on-health-check

mattisonchao commented Mar 8, 2026 •

edited

Loading

Uh oh!

mattisonchao left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mattisonchao commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Test plan

Uh oh!

mattisonchao left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mattisonchao commented Mar 8, 2026 •

edited

Loading