Make the number of healthy nodes in gossip section configurable#2746
Conversation
7b29e59 to
c5642e8
Compare
|
What are the theoretical implications of lowering the number? Sending 10 pings with n/10 gossips achieves the same information-spreading effect as sending 20 pings n/20 gossips? So failure detection and convenrgence of any changes slows down linearly with this config? I'm fine with a config like this, but I (and others, you included?) have a feeling we can gossip smarter without sacrificing anything. I really liked the idea of prioritizing gossips about node for which there was a recent change, this idea: #1897 (comment) Can we add a last-modified timetamp to each node and do a weighted random? Another idea is to increment a score for each node we gossip about and then prioritize the ones with lower score next time. |
I came across this idea in hashicorp's serf. It requires a bit of work to index node by different type and have logic around which ones to prioritise more. Quite achievable to have smarter gossip. This PR is a guardrail for the current system to avoid CPU/network spikes. |
We need to guarantee a message to be received directly or indirectly from another node within node-timeout/2 period. If that is met we don't send out another message. So, this might lead to more direct pings which has higher overhead. Gossip node information is 106 bytes. However, the entire payload is around 2200B. |
There was a problem hiding this comment.
I think this is pretty clean and we could merge in it for 9.1 to be able to tune and evaluate it over time, but are we sure we want to maintain this config going forward? We could say it's experimental and make it a hidden config to start with.
Can you mention in the top comment the theoretical implications of for example lowering the value to 5%? Is it like: If we didn't get gossip about a node during a node timeout, we ping it, so in the worst case we'll just send more pings? Doesn't this mean we can measure the cluster bus traffic and tune this value? When we've done that, we can change the default?
The failing test case is not affected by this change, right?
[err]: Multiple primary nodes are down, rank them based on the failed primary in tests/unit/cluster/failover2.tcl
No failover detected
Yeah, makes sense to move it to a hidden config.
Pretty spot on. Will add it above.
Taking a look. |
78c92ec to
1a62721
Compare
Signed-off-by: Harkrishn Patro <bunty.hari@gmail.com>
1a62721 to
40bc9f3
Compare
Signed-off-by: Harkrishn Patro <bunty.hari@gmail.com>
Signed-off-by: Harkrishn Patro <bunty.hari@gmail.com>
| * information would be broadcasted. */ | ||
| int pfail_wanted = server.cluster->stats_pfail_nodes; | ||
| if (pfail_wanted >= wanted) { | ||
| pfail_wanted = wanted; |
There was a problem hiding this comment.
I am not too sure about this but if pfail_wanted = wanted, is it possible that it will keep sending the same nodes again and again if the quorum is not reached to mark them as fail.
There was a problem hiding this comment.
Yeah the test failure seems to be pointing to that as well. We risk slower propagation of information. I'm trying to see how we can converge sooner.
Signed-off-by: Harkrishn Patro <bunty.hari@gmail.com>
|
The four failed for CI jobs all failed on this test case: Is it caused by this change? |
Yes, I'm suspicious of that. Let's decouple the |
Signed-off-by: Harkrishn Patro <bunty.hari@gmail.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## unstable #2746 +/- ##
============================================
+ Coverage 74.98% 75.05% +0.07%
============================================
Files 129 129
Lines 71551 71632 +81
============================================
+ Hits 53654 53765 +111
+ Misses 17897 17867 -30
🚀 New features to boost your workflow:
|
Signed-off-by: Harkrishn Patro <bunty.hari@gmail.com>
…le (valkey-io#2746)" This reverts commit 0ff55de.
This PR introduces a new config `cluster-message-gossip-perc` which allows an operator to modify the amount of gossip node information to be sent per ping/pong/meet message. It can be modified dynamically (Related to #2291). The default value is 10% i.e. 10% of peer node information would be gossiped along with each ping/pong/meet packet. Users can tune this configuration, setting the value higher allows faster information dissemination whereas setting it lower would lead to direct PING messages if no information was received about a node with the `server.cluster_node_timeout/2` period. Note: the behavior for partially failed gossip nodes still remains intact where all the `pfail` nodes are part of the message for faster propagation of information and faster transition of `PFAIL` to `FAIL`. --------- Signed-off-by: Harkrishn Patro <bunty.hari@gmail.com>
This PR introduces a new config
cluster-message-gossip-percwhich allows an operator to modify the amount of gossip node information to be sent per ping/pong/meet message. It can be modified dynamically (Related to #2291). The default value is 10% i.e. 10% of peer node information would be gossiped along with each ping/pong/meet packet. Users can tune this configuration, setting the value higher allows faster information dissemination whereas setting it lower would lead to direct PING messages if no information was received about a node with theserver.cluster_node_timeout/2period.Note: the behavior for partially failed gossip nodes still remains intact where all the
pfailnodes are part of the message for faster propagation of information and faster transition ofPFAILtoFAIL.