Skip to content

Add lightweight cluster PING/PONG heartbeats with full-refresh safeguards#5

Closed
hpatro wants to merge 2 commits into
unstablefrom
codex/execute-plan-and-ensure-tests-pass
Closed

Add lightweight cluster PING/PONG heartbeats with full-refresh safeguards#5
hpatro wants to merge 2 commits into
unstablefrom
codex/execute-plan-and-ensure-tests-pass

Conversation

@hpatro

@hpatro hpatro commented Mar 20, 2026

Copy link
Copy Markdown
Owner

Motivation

  • Reduce steady-state cluster-bus bandwidth by using a lightweight PING/PONG header for routine failure detection while preserving existing safety semantics.
  • Avoid regressions where topology/metadata and failover-sensitive flags (e.g. NOFAILOVER) become slow to propagate under light heartbeats.
  • Ensure recovered-node failure-report cleanup remains reliable by retransmitting clear signals until stale reports can no longer affect quorum decisions.

Description

  • Add light-heartbeat support guarded by a new capability bit CLUSTER_NODE_LIGHT_HDR_PING_SUPPORTED and per-node bookkeeping fields fail_report_clear_time and last_full_heartbeat_sent in clusterNode (see src/cluster_legacy.h).
  • Split gossip processing into clusterProcessGossipEntries() to allow both full clusterMsg and light clusterMsgLight handling, and add clusterMsgLight count helpers to validate/process light packets (src/cluster_legacy.c).
  • Implement clusterSendPingWithOptions() and clusterShouldSendFullHeartbeat() to choose between light and full heartbeats; light heartbeats carry compact failure gossip and retransmitted clear entries while full heartbeats still include topology, extensions and metadata (src/cluster_legacy.c).
  • Preserve fast propagation for header-only metadata by forcing immediate full broadcasts from clusterUpdateMyself*/update helpers (use CLUSTER_TODO_BROADCAST_ALL) and ensure periodic per-peer full heartbeats at the configured direct-ping interval (cluster_ping_interval or cluster_node_timeout/2).
  • Keep recovery-clear behavior deterministic by retransmitting recovered-node "clear" gossip until CLUSTER_FAIL_REPORT_VALIDITY_MULT * cluster_node_timeout expires, preventing stale external reports from causing false FAIL decisions (src/cluster_legacy.c).
  • Add unit test coverage and small test adjustments: extend tests/unit/cluster/failure-marking.tcl with a retransmit/regression scenario and adjust tests/unit/cluster/cross-version-cluster.tcl to reflect mixed-version expectations.

Testing

  • Built the server successfully with make -C src valkey-server -j4 (binary built).
  • Ran ./runtest --single unit/cluster/failure-marking --clients 1 and the suite completed with all tests passing (regressions for recovery-clear addressed by the implementation).
  • Ran a set of related cluster unit tests: unit/cluster/no-failover-option, unit/cluster/hostnames, unit/cluster/announce-client-ip, unit/cluster/announce-client-ports, and unit/cluster/announced-endpoints, and all executed tests passed.
  • Ran ./runtest --single unit/cluster/cross-version-cluster --clients 1; mixed-version scenarios that require an external older binary were skipped in this environment (these tests are preserved and exercise fallback/compatibility when run with --other-server-path).

Codex Task

Comment thread src/cluster_legacy.c Dismissed
@hpatro hpatro closed this Mar 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants