Skip to content

Make the number of healthy nodes in gossip section configurable#2746

Merged
hpatro merged 7 commits into
valkey-io:unstablefrom
hpatro:cluster_ping_gossip_count
Mar 9, 2026
Merged

Make the number of healthy nodes in gossip section configurable#2746
hpatro merged 7 commits into
valkey-io:unstablefrom
hpatro:cluster_ping_gossip_count

Conversation

@hpatro

@hpatro hpatro commented Oct 17, 2025

Copy link
Copy Markdown
Contributor

This PR introduces a new config cluster-message-gossip-perc which allows an operator to modify the amount of gossip node information to be sent per ping/pong/meet message. It can be modified dynamically (Related to #2291). The default value is 10% i.e. 10% of peer node information would be gossiped along with each ping/pong/meet packet. Users can tune this configuration, setting the value higher allows faster information dissemination whereas setting it lower would lead to direct PING messages if no information was received about a node with the server.cluster_node_timeout/2 period.

Note: the behavior for partially failed gossip nodes still remains intact where all the pfail nodes are part of the message for faster propagation of information and faster transition of PFAIL to FAIL.

@hpatro hpatro requested a review from madolson October 17, 2025 18:29
@hpatro hpatro force-pushed the cluster_ping_gossip_count branch 2 times, most recently from 7b29e59 to c5642e8 Compare October 17, 2025 18:40
Comment thread src/cluster_legacy.c Outdated
Comment thread src/cluster_legacy.c Outdated
Comment thread src/cluster_legacy.c Outdated
@zuiderkwast

Copy link
Copy Markdown
Contributor

What are the theoretical implications of lowering the number?

Sending 10 pings with n/10 gossips achieves the same information-spreading effect as sending 20 pings n/20 gossips? So failure detection and convenrgence of any changes slows down linearly with this config?

I'm fine with a config like this, but I (and others, you included?) have a feeling we can gossip smarter without sacrificing anything.

I really liked the idea of prioritizing gossips about node for which there was a recent change, this idea: #1897 (comment)

Can we add a last-modified timetamp to each node and do a weighted random?

Another idea is to increment a score for each node we gossip about and then prioritize the ones with lower score next time.

@hpatro

hpatro commented Oct 17, 2025

Copy link
Copy Markdown
Contributor Author

I really liked the idea of prioritizing gossips about node for which there was a recent change, this idea: #1897 (comment)

I came across this idea in hashicorp's serf. It requires a bit of work to index node by different type and have logic around which ones to prioritise more. Quite achievable to have smarter gossip.

This PR is a guardrail for the current system to avoid CPU/network spikes.

@hpatro

hpatro commented Oct 18, 2025

Copy link
Copy Markdown
Contributor Author

What are the theoretical implications of lowering the number?

We need to guarantee a message to be received directly or indirectly from another node within node-timeout/2 period. If that is met we don't send out another message.

So, this might lead to more direct pings which has higher overhead. Gossip node information is 106 bytes. However, the entire payload is around 2200B.

Comment thread src/config.c Outdated

@zuiderkwast zuiderkwast left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is pretty clean and we could merge in it for 9.1 to be able to tune and evaluate it over time, but are we sure we want to maintain this config going forward? We could say it's experimental and make it a hidden config to start with.

Can you mention in the top comment the theoretical implications of for example lowering the value to 5%? Is it like: If we didn't get gossip about a node during a node timeout, we ping it, so in the worst case we'll just send more pings? Doesn't this mean we can measure the cluster bus traffic and tune this value? When we've done that, we can change the default?

The failing test case is not affected by this change, right?

[err]: Multiple primary nodes are down, rank them based on the failed primary in tests/unit/cluster/failover2.tcl
No failover detected

Comment thread src/cluster_legacy.c Outdated
Comment thread src/cluster_legacy.c Outdated
Comment thread src/cluster_legacy.c
@hpatro

hpatro commented Mar 5, 2026

Copy link
Copy Markdown
Contributor Author

I think this is pretty clean and we could merge in it for 9.1 to be able to tune and evaluate it over time, but are we sure we want to maintain this config going forward? We could say it's experimental and make it a hidden config to start with.

Yeah, makes sense to move it to a hidden config.

Can you mention in the top comment the theoretical implications of for example lowering the value to 5%? Is it like: If we didn't get gossip about a node during a node timeout, we ping it, so in the worst case we'll just send more pings? Doesn't this mean we can measure the cluster bus traffic and tune this value? When we've done that, we can change the default?

Pretty spot on. Will add it above.

The failing test case is not affected by this change, right?

[err]: Multiple primary nodes are down, rank them based on the failed primary in tests/unit/cluster/failover2.tcl
No failover detected

Taking a look.

@hpatro hpatro force-pushed the cluster_ping_gossip_count branch 3 times, most recently from 78c92ec to 1a62721 Compare March 5, 2026 22:01
Signed-off-by: Harkrishn Patro <bunty.hari@gmail.com>
@hpatro hpatro force-pushed the cluster_ping_gossip_count branch from 1a62721 to 40bc9f3 Compare March 6, 2026 05:40
Signed-off-by: Harkrishn Patro <bunty.hari@gmail.com>
Comment thread valkey.conf Outdated
Comment thread src/config.c Outdated
Signed-off-by: Harkrishn Patro <bunty.hari@gmail.com>
Comment thread src/cluster_legacy.c Outdated
* information would be broadcasted. */
int pfail_wanted = server.cluster->stats_pfail_nodes;
if (pfail_wanted >= wanted) {
pfail_wanted = wanted;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not too sure about this but if pfail_wanted = wanted, is it possible that it will keep sending the same nodes again and again if the quorum is not reached to mark them as fail.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the test failure seems to be pointing to that as well. We risk slower propagation of information. I'm trying to see how we can converge sooner.

@zuiderkwast

Copy link
Copy Markdown
Contributor

The four failed for CI jobs all failed on this test case:

*** [err]: Multiple primary nodes are down, rank them based on the failed primary in tests/unit/cluster/failover2.tcl
No failover detected

Is it caused by this change?

@hpatro

hpatro commented Mar 6, 2026

Copy link
Copy Markdown
Contributor Author

The four failed for CI jobs all failed on this test case:

*** [err]: Multiple primary nodes are down, rank them based on the failed primary in tests/unit/cluster/failover2.tcl
No failover detected

Is it caused by this change?

Yes, I'm suspicious of that. Let's decouple the pfail nodes accounting as part of this change. I will tackle it separately. The reasoning is it slows down all the pfail node discovery and would rather do it in a batched manner. I've reverted to the default mechanism 83dc041.

Signed-off-by: Harkrishn Patro <bunty.hari@gmail.com>
@hpatro hpatro changed the title Bound the number of nodes in gossip section Bound the number of healthy nodes in gossip section Mar 6, 2026
@hpatro hpatro changed the title Bound the number of healthy nodes in gossip section Make the number of healthy nodes in gossip section configurable Mar 6, 2026
@codecov

codecov Bot commented Mar 6, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 75.05%. Comparing base (8c397ea) to head (77eb3b1).
⚠️ Report is 6 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #2746      +/-   ##
============================================
+ Coverage     74.98%   75.05%   +0.07%     
============================================
  Files           129      129              
  Lines         71551    71632      +81     
============================================
+ Hits          53654    53765     +111     
+ Misses        17897    17867      -30     
Files with missing lines Coverage Δ
src/cluster_legacy.c 88.10% <ø> (+0.03%) ⬆️
src/config.c 78.56% <ø> (ø)
src/server.h 100.00% <ø> (ø)

... and 23 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@sarthakaggarwal97 sarthakaggarwal97 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@zuiderkwast zuiderkwast left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, looks good.

Comment thread tests/unit/cluster/packet.tcl Outdated
Signed-off-by: Harkrishn Patro <bunty.hari@gmail.com>
@hpatro hpatro merged commit 0ff55de into valkey-io:unstable Mar 9, 2026
71 of 72 checks passed
JimB123 added a commit to JimB123/valkey that referenced this pull request Mar 10, 2026
JimB123 pushed a commit that referenced this pull request Mar 19, 2026
This PR introduces a new config `cluster-message-gossip-perc` which
allows an operator to modify the amount of gossip node information to be
sent per ping/pong/meet message. It can be modified dynamically (Related
to #2291). The default value
is 10% i.e. 10% of peer node information would be gossiped along with
each ping/pong/meet packet. Users can tune this configuration, setting
the value higher allows faster information dissemination whereas setting
it lower would lead to direct PING messages if no information was
received about a node with the `server.cluster_node_timeout/2` period.

Note: the behavior for partially failed gossip nodes still remains
intact where all the `pfail` nodes are part of the message for faster
propagation of information and faster transition of `PFAIL` to `FAIL`.

---------

Signed-off-by: Harkrishn Patro <bunty.hari@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants