-
Notifications
You must be signed in to change notification settings - Fork 24.4k
Add cluster shards support. #10293
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add cluster shards support. #10293
Conversation
src/cluster.c
Outdated
| void *shard_replylen = addReplyDeferredLen(c); | ||
| int shard_cnt = 0; | ||
| dict *nodeToSlotPair = dictCreate(&clusterNodesToSlotDictType); | ||
| for (int i = 0; i <= CLUSTER_SLOTS; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This exact slot enumeration logic has already been used twice in cluster.c. I wonder if it helps to single source it with a callback pattern that takes two parameters: the start index and the length. More specifically, I am thinking that the code block between line 5080 and 5094, inclusive, can be moved into a callback function while the rest of the for loop can be extracted into a generic enumeration function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've thought about how to deduplicate this slot range enumeration logic too. My idea is create an iterator function which returns each slot range of a node:
int start = 0, end = 0;
while (nextSlotRange(node, &start, &end)) {
// do something for each range
}I have an implementation. I can post it tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth mentioning we have two types of iterations:
- We do a straight iteration through all of the slots and group them by slot ranges (see
CLUSTER SLOTS) - We want a result which is the ordered list of nodes and then we want to iterate through those node's slots (see
CLUSTER NODES.
Need to make sure the iterator supports both, or still makes 2 easily attainable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@madolson Good point. The version posted below only iterates over a single node (i.e. it covers 2 but not 1) but I guess it can be adapted, or possibly 1 is different enough to have a separate iteration function for that.
int nextSlotRange(clusterNode *node, int *begin, int *end);
/* Iterate over the slot ranges of a node. Use like this:
*
* int start = 0, end = 0;
* while (nextSlotRange(node, &start, &end) {
* // This node has range start..end.
* }
*/
int nextSlotRange(clusterNode *node, int *begin, int *end) {
int j;
int start = -1;
for (j = (*end == 0) ? 0 : *end + 1; j < CLUSTER_SLOTS; j++) {
int bit = clusterNodeGetSlotBit(node, j);
if (bit && start == -1) start = j;
if (start != -1 && (!bit || j == CLUSTER_SLOTS-1)) {
if (bit && j == CLUSTER_SLOTS-1) j++;
*begin = start;
*end = j - 1;
return 1;
}
}
return 0;
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zuiderkwast I have added a conditional check if the node provided is null, we move on to the next slot. WDYT ?
int nextSlotRange(clusterNode *node, int *begin, int *end) {
int j;
int start = -1;
if (node == NULL) {
*begin = *begin + 1;
*end = *begin;
if (*end == CLUSTER_SLOTS) {
return 0;
} else {
return 1;
}
}
for (j = (*end == 0) ? 0 : *end + 1; j < CLUSTER_SLOTS; j++) {
int bit = clusterNodeGetSlotBit(node, j);
if (bit && start == -1) start = j;
if (start != -1 && (!bit || j == CLUSTER_SLOTS - 1)) {
if (bit && j == CLUSTER_SLOTS - 1) j++;
*begin = start;
*end = j - 1;
return 1;
}
}
return 0;
}
int start = 0, end = 0;
clusterNode *n = server.cluster->slots[start];
while (nextSlotRange(n, &start, &end)) {
if (n != NULL) {
sds name = sdsnewlen(n->name, CLUSTER_NAMELEN);
dictEntry *de = dictFind(nodeToSlotPair, name);
sdsfree(name);
list *slotPair;
if (de == NULL) {
slotPair = listCreate();
dictAdd(nodeToSlotPair, sdsnewlen(n->name, CLUSTER_NAMELEN), slotPair);
} else {
slotPair = dictGetVal(de);
}
if (start != end) {
listAddNodeTail(slotPair, sdscatfmt(sdsempty(), "%i-%i", start, end));
} else {
listAddNodeTail(slotPair, sdscatfmt(sdsempty(), "%i", start));
}
}
if ((end + 1) != CLUSTER_SLOTS) {
n = server.cluster->slots[end + 1];
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hpatro if n == NULL, the function returns a range with a single slot for no node? I think it is strange. It relies on the caller to make sure that begin + 1 has no node. Maybe you meant *begin = *end + 1? I think the next range starts where the previous range ends. Or did you change this?
Anyway, this function cannot check this slot has no node, so I think it is strange. Better check if n == NULL and find node with a slot before calling the iterator function.
Maybe we need another function nextSlotRangeAndNode(*begin, *end, *node) which also returns the node of the next range...? This logic is different and it is less complex to make them separate functions IMO.
int start = 0, end = 0;
clusterNode *n;
while (nextSlotRangeAndNode(&start, &end, &n)) {
if (n == NULL) continue;
// use node and range
}Implementation (untested):
/* Finds the next slot range and its node, starting at end + 1. The returned node
* is NULL if the slot range is unassigned. */
int nextSlotRangeAndNode(int *begin, int *end, clusterNode **node) {
int start = (*end == 0) ? 0 : *end + 1;
if (start >= CLUSTER_SLOTS)
return 0;
clusterNode *n = server.cluster->slots[start];
int j = start;
while (j < CLUSTER_SLOTS && server.cluster->slots[j] == n)
j++;
*begin = start;
*end = j - 1;
*node = n;
return 1;
}One question: Why do we need to return the slots in ascending order? Can't we just use the order of the master nodes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, I'm not picking this up. I have shared the code between CLUSTER NODES and CLUSTER SHARDS. May be we can come back later to improve the redundancy for CLUSTER SLOTS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably fine, @zuiderkwast maybe make an issue for improvement?
|
I think the slot ranges should be represented as two integers (as @madolson suggested in the issue) rather than a string on the form "start-end". This way we use resp for all structure and no extra parsing of a string is needed. |
@zuiderkwast Actually Madelyn and I discussed this offline and thought with an extremely defragmented slot distribution, this would be cheaper for network out and has minimal parsing for the client. |
I think @zuiderkwast has a good point. Dealing with exceptions always adds complexity. Based on my quick back-of-the-envelope calculation, I think we could achieve an overall similar footprint (if not better) using the original proposal with one trick, which is to encode the length of the slot range as opposed to the end slot number. Here is an example
$9\r\n1000-1000\r\n The total length is 15 bytes.
*2\r\n:1000\r\n:1\r\n The total length is also 15 bytes. Note that a 4-digit slot number is where the two encoding schemes result in the same footprint. When the slot number is greater than 9999, the second encoding scheme yields a smaller footprint while, when the slot number is smaller than 1000, the first encoding scheme yields a smaller footprint. Therefore, in the most extreme case when no two slots are next to each other in the same shard, the second encoding scheme would actually yield an overall smaller footprint while remaining friendly to generic RESP parsing. |
|
Edit: After looking at the code some more, it seems inconsistent to have different mechanisms return with different values. The intention is this a "replacement" of sorts for cluster nodes, which actually gives slots in the form "a-b" or "a", so I think it's actually reasonable to re-use the code that generates the nodes output. |
@madolson did you actually mean a replacement for CLUSTER SLOTS? I thought one of the issues called out in #10168 was about the CLUSTER NODES output not being RESP compliant. Also, the output format implemented here is modeled after CLUSTER SLOTS than CLUSTER NODES. That said, the use of "length" instead of the "end slot number" does break away from the CLUSTER SLOTS norm. We are probably looking at an additional ~55KB (6,4K*4 + 9K*3 + .9K*2) overhead in the most fragmented case if we were to go with full RESP compatibility. Assuming each node taking up 300B in the output, a 100-node cluster, and 15B per slot range with the "-" representation, the total output size is about 300B*100 + 15B * 16K = 270KB. 55KB is about 20% increase but again this is an extreme case. So here is what I see the tough call lies
|
|
In case a node owns a single slot, the network output would be something like this for the dash |
|
I think the difference of 5-10 bytes per slot range is not very important, even if the difference is 50K. The real difference is when compared to CLUSTER SLOTS, where the complete nodes are repeated for each slot range. Assuming a maximally fragmented cluster with one master and two replicas per shard, assuming 100B per node info, CLUSTER SLOTS is 3*100*16K = 4.8M. So, even if CLUSTER SHARDS is 100K, we save 98% compared to CLUSTER SLOTS. Among the options we have:
I think 1 and 2 are acceptable, but 3 is rather obscure. |
src/cluster.c
Outdated
| int start = -1; | ||
| void *shard_replylen = addReplyDeferredLen(c); | ||
| int shard_cnt = 0; | ||
| dict *nodeToSlotPair = dictCreate(&clusterNodesToSlotDictType); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating a temporary dict seems unnecessary. Maybe we can avoid it?
I think it's cheaper to iterate over the structures and add replies directly without allocating any extra structures, even if we might need to iterate over the nodes one extra time e.g. once just to count the shards to addReplyArrayLen(c, numshards);.
If we need to create a structure like this one, then perhaps it can be stored permanently or maybe we can change the existing structures to be more optimized for CLUSTER SHARDS (if possible).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could re-use the iterator logic for CLUSTER NODES that loops through all the slots an caches the response on each node. We can then re-iterate through the cached nodes to add the output.
+1
Agreed and 2) would be my preference. |
@zuiderkwast / @madolson Do you guys have any preference ? |
|
I'm fine with 1 or 2. I have a slight preference towards option 2 because it's the simplest for clients to implement. @PingXie I did mean cluster nodes. Anyone that didn't like the cluster slots output switched to using |
yossigo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't looked at the details yet, will do shortly, but I want to point out a potential ambiguity using the term "shard".
It has not been used by Redis Cluster until now, and in other contexts it has been used as either a synonym to "cluster node", or (as used in this PR), to describe a master and set of replicas sharing the same slots. To avoid confusion, I think we should avoid that using it here.
@yossigo With the new pubsub feature we decided to use "Sharded PubSub" as the terminology to describe a master/set of replica sharing the same slot. Shall we continue using the same definition here ? |
|
@hpatro I think "sharded" and "sharding" is not a problem, my only concern is around the current ambiguity around "shard" and whether it refers to a single cluster node or a set of nodes with the same hash slots (master + replicas). |
|
@yossigo I think the term shard in this sense is very convenient. The result of this command is an array of info per shard. Do you prefer TOPOLOGY or do have another suggestion(s)? Then we can try to compare pros and cons... Another possible term is "slice". The term "network slicing" is used in telecom with a similar meaning. |
|
The notion of "shard" was probably unofficially introduced already based on the documentation:
I feel that it is very natural to go ahead and make it official. Besides, "shard" is a well established term in distributed systems and I don't see the use of "shard" in this context diverge from the norm. The other (less) popular term that I have seen is "partition" but that is quite foreign to Redis, to my best knowledge. I have not seen "slice" used in distributed storage/database systems. |
|
While we are here, can we also consider having the node's epoch included in the output as well? The client application can bounce between nodes at run time so there are times it'd receive conflicting "shard" topology. An embedded epoch value would go a long way to help the client application resolve the collisions. @zuiderkwast this is somewhat related to our discussion on your topology change pubsub proposal #10150. |
|
Sorry for taking so long to weigh in, but I just also wanted voice that I think it's good to make the term "shard" more official. The only downside I can think of is it will become more complex if we ever want to introduce more flexible configurations. For example, consider the scenario if we wanted to allow masters to be replicas for other masters, so that on node failure we have the data replicated but all nodes are able to take writes. |
@madolson trying to get a better understanding of "make config generations include all of the state information". Are you talking about adding the remaining states like
Sure. I don't have a strong opinion here and I haven't seen updates to #10195 either. Agreed we can revisit this later when there is a real use case.
Forgot to mention that I think we'd also need my changes (#10163) to ensure that failed nodes are included after a replica takes over the primary-ship in the shard. I can port it over after this PR is merged. |
@PingXie Include other metadata such as which replicas are serving data from which masters, updates for announced hostnames and IP addresses, any other future config state changes. |
|
Added documentation: redis/redis-doc#1815 |
Thanks @madolson. I would like to clarify that there is a legit and real scenario for having epochs in |
oranagra
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved in a core-team meeting. please merge away (and make sure the docs are updated too)
|
|
||
| set node_0_id [R 0 CLUSTER MYID] | ||
|
|
||
| test "Kill a node and tell the replica to immediately takeover" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fails in my GH cluster test
https://github.com/enjoy-binbin/redis/runs/5563545978?check_suite_focus=true#step:9:681
02:18:22> Kill a node and tell the replica to immediately takeover: FAILED: caught an error in the test Port 30000 does not return available after killing instance.
Port 30000 does not return available after killing instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will take a look here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bane of my existence, for some reason this test only fails when you run the entire suite but passes individually. I'm committing to deprecating this test suite now.
| int tls_port = server.cluster_announce_tls_port ? server.cluster_announce_tls_port : server.tls_port; | ||
| if (tls_port) { | ||
| addReplyBulkCString(c, "tls-port"); | ||
| addReplyLongLong(c, tls_port); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please take a look here (tls_port and the above port in tls mode)
the cluster-shards test failed in tls. (https://github.com/redis/redis/runs/5578694306?check_suite_focus=true#step:9:682)
with this patch, the test pass, but I'm not very familiar with tls mode
int port = server.cluster_announce_port ? server.cluster_announce_port : server.port;
int tls_port = server.cluster_announce_tls_port ? server.cluster_announce_tls_port : server.tls_port;
if (port) {
addReplyBulkCString(c, "port");
if (tls_port)
addReplyLongLong(c, node->pport);
else
addReplyLongLong(c, node->port);
reply_count++;
}
if (tls_port) {
addReplyBulkCString(c, "tls-port");
addReplyLongLong(c, node->port);
reply_count++;
}
in case anyone will fix it, i would like to mention:
- since we deprecated cluster-slots, maybe we need to add a doc_flags in cluster-slots.json like others
- i notice cluster-shards.json, it's indented by two spaces, we use four spaces elsewhere
- and the three
wait_for_condition 50 100in 28-cluster-shards.tcl, there may be a timing issue, at least I've encountered it locally (centos7), so maybe change it to 500 100
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For 1, we marked it as deprecated. Was there something else that you meant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i mean the doc_flags, like other commands?
"deprecated_since": "6.2.0",
"replaced_by": "`GEOSEARCH` and `GEOSEARCHSTORE` with the `BYRADIUS` argument",
"doc_flags": [
"DEPRECATED"
],
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I guess I didn't realize that was a thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hpatro Do you have time to address this? Otherwise I can take a look tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@enjoy-binbin Regarding the port and tls-port logic: server.tls_port is always a TLS port and server.port is always a plaintext (AKA cleartext) port if specified. For the cluster bus, it depends on the server.tls_cluster variable:
if (server.tls_cluster) {
// node->port is TLS port (always set)
// node->pport is plaintext port, optional (if non-zero)
} else {
// node->port is plaintext port (always set)
// There is no TLS port
}There are a few places that use a hard coded const of 128 to allocate a buffer for d2string. Replace these with a clear macro. Note that In theory, converting double into string could take as much as nearly 400 chars, but since d2string uses `%g` and not `%f`, it won't pass some 40 chars. unrelated: restore some changes to auto generated commands.c that got accidentally reverted in #10293
This commit improve malloc efficiency of the slots_info_pairs mechanism in cluster.c by changing adlist into an array being realloced with greedy growth mechanism Recently the cluster tests are consistently failing when executed with ASAN in the CI. I tried to track down the commit that started it, and it appears to be #10293. Looking at the commit, i realize it didn't affect this test / flow, other than the replacement of the slots_info_pairs from sds to list. I concluded that what could be happening is that the slot range is very fragmented, and that results in many allocations. with sds, it results in one allocation and also, we have a greedy growth mechanism, but with adlist, we just have many many small allocations. this probably causes stress on ASAN, and causes it to be slow at termination.
| while((ln = listNext(&li))) { | ||
| addReplyBulkLongLong(c, (unsigned long)listNodeValue(ln)); | ||
| ln = listNext(&li); | ||
| addReplyBulkLongLong(c, (unsigned long)listNodeValue(ln)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the slots here is not integers, see #10680
inconsistent with top comment or CLUSTER SHARDS docs
127.0.0.1:5001> cluster shards
1) 1) "slots"
2) 1) "0"
2) "16383"
It used to returns slots as strings, like:
```
redis> cluster shards
1) 1) "slots"
2) 1) "10923"
2) "16383"
```
CLUSTER SHARDS docs and the top comment of redis#10293 says that it returns integers.
Note other commands like CLUSTER SLOTS, it returns slots as integers.
Use addReplyLongLong instead of addReplyBulkLongLong, now it returns slots as integers:
```
redis> cluster shards
1) 1) "slots"
2) 1) (integer) 10923
2) (integer) 16383
```
This is a small breaking change, introduced in 7.0.0 (7.0 RC3, redis#10293)
Fixes redis#10680
It used to returns slots as strings, like:
```
redis> cluster shards
1) 1) "slots"
2) 1) "10923"
2) "16383"
```
CLUSTER SHARDS docs and the top comment of #10293 says that it returns integers.
Note other commands like CLUSTER SLOTS, it returns slots as integers.
Use addReplyLongLong instead of addReplyBulkLongLong, now it returns slots as integers:
```
redis> cluster shards
1) 1) "slots"
2) 1) (integer) 10923
2) (integer) 16383
```
This is a small breaking change, introduced in 7.0.0 (7.0 RC3, #10293)
Fixes #10680
…0683) It used to returns slots as strings, like: ``` redis> cluster shards 1) 1) "slots" 2) 1) "10923" 2) "16383" ``` CLUSTER SHARDS docs and the top comment of redis#10293 says that it returns integers. Note other commands like CLUSTER SLOTS, it returns slots as integers. Use addReplyLongLong instead of addReplyBulkLongLong, now it returns slots as integers: ``` redis> cluster shards 1) 1) "slots" 2) 1) (integer) 10923 2) (integer) 16383 ``` This is a small breaking change, introduced in 7.0.0 (7.0 RC3, redis#10293) Fixes redis#10680
Ref #10168
Implement a new
CLUSTER SHARDScommand which provides more information to clients in an extensible format. This intended to be an alternative toCLUSTER SLOTSthat provides more information and is more efficient for clusters with disjoint slot assignment. It is also to help migrate some clients currently usingCLUSTER NODES.The cluster slot command is a map of "shards", where a shard is a collection of nodes serving the same set of slots. If a node is serving no slots, it is said to exist in its own shard. Each shard contains a dictionary with two properties, "slots" and "nodes". Slots are a list of two value integer which indicate start and stop slot ranges. Nodes is a dictionary that contains the following attributres:
(Note, I think endpoint is missing, also replication-offset is missing from masters)
Sample request/response:
Open considerations