Skip to content

Conversation

@hpatro
Copy link
Contributor

@hpatro hpatro commented Feb 13, 2022

Ref #10168

Implement a new CLUSTER SHARDS command which provides more information to clients in an extensible format. This intended to be an alternative to CLUSTER SLOTS that provides more information and is more efficient for clusters with disjoint slot assignment. It is also to help migrate some clients currently using CLUSTER NODES.

The cluster slot command is a map of "shards", where a shard is a collection of nodes serving the same set of slots. If a node is serving no slots, it is said to exist in its own shard. Each shard contains a dictionary with two properties, "slots" and "nodes". Slots are a list of two value integer which indicate start and stop slot ranges. Nodes is a dictionary that contains the following attributres:

  • ID: The unique ID for this node.
  • port: The TCP port for this node, if available.
  • tls-port: The TLS port for this node, if available.
  • hostname: The propagated hostname for this node, if available.
  • endpoint: The preferred endpoint for the node. See https://redis.io/commands/cluster-slots for what the value of this represents.
  • role: Either master or replica. Please consider the alternative where we say the "first" node in the list is the master.
  • replication-offset: Reported for masters and replicas. This field can be used to read from the most "up-to date" replica, or checking how far behind replicas are from the master.
  • health: The reported health of this node to clients, one of three values:
    • online: The node is online and healthy.
    • fail: The node is confirmed dead.
    • loading: The node is currently doing a full sync, so don't try to connect to it.

(Note, I think endpoint is missing, also replication-offset is missing from masters)

Sample request/response:

127.0.0.1:6379> CLUSTER SHARDS
1) 1) "slots"
   2) 1) (integer) 10923
      2) (integer) 11110
      3) (integer) 11113
      4) (integer) 16111
      5) (integer) 16113
      6) (integer) 16383
   3) "nodes"
   4) 1)  1) "id"
          2) "71f058078c142a73b94767a4e78e9033d195dfb4"
          3) "port"
          4) (integer) 6381
          5) "ip"
          6) "127.0.0.1"
          7) "role"
          8) "primary"
          9) "health"
         10) "online"
      2)  1) "id"
          2) "1461967c62eab0e821ed54f2c98e594fccfd8736"
          3) "port"
          4) (integer) 7381
          5) "ip"
          6) "127.0.0.1"
          7) "role"
          8) "replica"
          9) "replication-offset"
         10) (integer) 0
         11) "health"
         12) "fail"
2) 1) "slots"
   2) 1) (integer) 5461
      2) (integer) 10922
   3) "nodes"
   4) 1)  1) "id"
          2) "9215e30cd4a71070088778080565de6ef75fd459"
          3) "port"
          4) (integer) 6380
          5) "ip"
          6) "127.0.0.1"
          7) "role"
          8) "primary"
          9) "health"
         10) "online"
      2)  1) "id"
          2) "877fa59da72cb902d0563d3d8def3437fc3a0196"
          3) "port"
          4) (integer) 7380
          5) "ip"
          6) "127.0.0.1"
          7) "role"
          8) "replica"
          9) "replication-offset"
         10) (integer) 0
         11) "health"
         12) "loading"
3) 1) "slots"
   2) 1) (integer) 0
      2) (integer) 5460
      3) (integer) 11111
      4) (integer) 11112
      3) (integer) 16112
      4) (integer) 16112
   3) "nodes"
   4) 1)  1) "id"
          2) "b7e9acc0def782aabe6b596f67f06c73c2ffff93"
          3) "port"
          4) (integer) 7379
          5) "ip"
          6) "127.0.0.1"
          7) "role"
          8) "primary"
          9) "health"
         10) "online"
      2)  1) "id"
          2) "e2acf1a97c055fd09dcc2c0dcc62b19a6905dbc8"
          3) "port"
          4) (integer) 6379
          5) "ip"
          6) "127.0.0.1"
          7) "hostname"
          8) "host-1.com"
          9) "role"
         10) "replica"
         11) "replication-offset"
         12) (integer) 0
         13) "health"
         14) "loading"

Open considerations

  • Should we expose cluster epoch in some capacity.
    • Pros: Clients will be able to compare multiple different slot responses, and compare epochs. (Omitted for now, we can add it later)
    • Cons: Epochs are not coherent except for mastership membership, so not a great indicator of anything.
  • Should we also expose PFAIL (or timedout) as a health indicator. (Omitted, since it's not a reliable indicator for node health and we don't want clients to use it)
    • Pros: Some clients want to be able to know a node is timedout and might be in fail soon. This may also be useful in split network situations, but I'm not clear of that use case.
    • Cons: The time delta between pfail and fail for a hard down node is pretty small, since it will start gossiping the failure immediately. Clients might implement extra logic to handle this case.
  • Code changes
  • Test cases for slot/slotrange/nodes/roles/ip/port/node_name
  • Test cases for health as LOADING
  • Test cases for endpoint

src/cluster.c Outdated
void *shard_replylen = addReplyDeferredLen(c);
int shard_cnt = 0;
dict *nodeToSlotPair = dictCreate(&clusterNodesToSlotDictType);
for (int i = 0; i <= CLUSTER_SLOTS; i++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This exact slot enumeration logic has already been used twice in cluster.c. I wonder if it helps to single source it with a callback pattern that takes two parameters: the start index and the length. More specifically, I am thinking that the code block between line 5080 and 5094, inclusive, can be moved into a callback function while the rest of the for loop can be extracted into a generic enumeration function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've thought about how to deduplicate this slot range enumeration logic too. My idea is create an iterator function which returns each slot range of a node:

int start = 0, end = 0;
while (nextSlotRange(node, &start, &end)) {
   // do something for each range
}

I have an implementation. I can post it tomorrow.

Copy link
Contributor

@madolson madolson Feb 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth mentioning we have two types of iterations:

  1. We do a straight iteration through all of the slots and group them by slot ranges (see CLUSTER SLOTS)
  2. We want a result which is the ordered list of nodes and then we want to iterate through those node's slots (see CLUSTER NODES.

Need to make sure the iterator supports both, or still makes 2 easily attainable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@madolson Good point. The version posted below only iterates over a single node (i.e. it covers 2 but not 1) but I guess it can be adapted, or possibly 1 is different enough to have a separate iteration function for that.

int nextSlotRange(clusterNode *node, int *begin, int *end);
/* Iterate over the slot ranges of a node. Use like this:
 *
 *     int start = 0, end = 0;
 *     while (nextSlotRange(node, &start, &end) {
 *         // This node has range start..end.
 *     }
 */
int nextSlotRange(clusterNode *node, int *begin, int *end) {
    int j;
    int start = -1;
    for (j = (*end == 0) ? 0 : *end + 1; j < CLUSTER_SLOTS; j++) {
        int bit = clusterNodeGetSlotBit(node, j);
        if (bit && start == -1) start = j;
        if (start != -1 && (!bit || j == CLUSTER_SLOTS-1)) {
            if (bit && j == CLUSTER_SLOTS-1) j++;
            *begin = start;
            *end = j - 1;
            return 1;
        }
    }
    return 0;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zuiderkwast I have added a conditional check if the node provided is null, we move on to the next slot. WDYT ?

int nextSlotRange(clusterNode *node, int *begin, int *end) {
    int j;
    int start = -1;
    if (node == NULL) {
        *begin = *begin + 1;
        *end = *begin;
        if (*end == CLUSTER_SLOTS) {
            return 0;
        } else {
            return 1;
        }
    }
    for (j = (*end == 0) ? 0 : *end + 1; j < CLUSTER_SLOTS; j++) {
        int bit = clusterNodeGetSlotBit(node, j);
        if (bit && start == -1) start = j;
        if (start != -1 && (!bit || j == CLUSTER_SLOTS - 1)) {
            if (bit && j == CLUSTER_SLOTS - 1) j++;
            *begin = start;
            *end = j - 1;
            return 1;
        }
    }
    return 0;
}
    int start = 0, end = 0;
    clusterNode *n = server.cluster->slots[start];
    while (nextSlotRange(n, &start, &end)) {
        if (n != NULL) {
            sds name = sdsnewlen(n->name, CLUSTER_NAMELEN);
            dictEntry *de = dictFind(nodeToSlotPair, name);
            sdsfree(name);
            list *slotPair;
            if (de == NULL) {
                slotPair = listCreate();
                dictAdd(nodeToSlotPair, sdsnewlen(n->name, CLUSTER_NAMELEN), slotPair);
            } else {
                slotPair = dictGetVal(de);
            }
            if (start != end) {
                listAddNodeTail(slotPair, sdscatfmt(sdsempty(), "%i-%i", start, end));
            } else {
                listAddNodeTail(slotPair, sdscatfmt(sdsempty(), "%i", start));
            }
        }
        if ((end + 1) != CLUSTER_SLOTS) {
            n = server.cluster->slots[end + 1];
        }
    }

Copy link
Contributor

@zuiderkwast zuiderkwast Feb 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hpatro if n == NULL, the function returns a range with a single slot for no node? I think it is strange. It relies on the caller to make sure that begin + 1 has no node. Maybe you meant *begin = *end + 1? I think the next range starts where the previous range ends. Or did you change this?

Anyway, this function cannot check this slot has no node, so I think it is strange. Better check if n == NULL and find node with a slot before calling the iterator function.

Maybe we need another function nextSlotRangeAndNode(*begin, *end, *node) which also returns the node of the next range...? This logic is different and it is less complex to make them separate functions IMO.

    int start = 0, end = 0;
    clusterNode *n;
    while (nextSlotRangeAndNode(&start, &end, &n)) {
        if (n == NULL) continue;
        // use node and range
    }

Implementation (untested):

/* Finds the next slot range and its node, starting at end + 1. The returned node
 * is NULL if the slot range is unassigned. */
int nextSlotRangeAndNode(int *begin, int *end, clusterNode **node) {
    int start = (*end == 0) ? 0 : *end + 1;
    if (start >= CLUSTER_SLOTS)
        return 0;
    clusterNode *n = server.cluster->slots[start];
    int j = start;
    while (j < CLUSTER_SLOTS && server.cluster->slots[j] == n)
        j++;
    *begin = start;
    *end = j - 1;
    *node = n;
    return 1;
}

One question: Why do we need to return the slots in ascending order? Can't we just use the order of the master nodes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, I'm not picking this up. I have shared the code between CLUSTER NODES and CLUSTER SHARDS. May be we can come back later to improve the redundancy for CLUSTER SLOTS

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably fine, @zuiderkwast maybe make an issue for improvement?

@zuiderkwast
Copy link
Contributor

I think the slot ranges should be represented as two integers (as @madolson suggested in the issue) rather than a string on the form "start-end". This way we use resp for all structure and no extra parsing of a string is needed.

@hpatro
Copy link
Contributor Author

hpatro commented Feb 13, 2022

I think the slot ranges should be represented as two integers (as @madolson suggested in the issue) rather than a string on the form "start-end". This way we use resp for all structure and no extra parsing of a string is needed.

@zuiderkwast Actually Madelyn and I discussed this offline and thought with an extremely defragmented slot distribution, this would be cheaper for network out and has minimal parsing for the client.

@PingXie
Copy link
Contributor

PingXie commented Feb 13, 2022

I think the slot ranges should be represented as two integers (as @madolson suggested in the issue) rather than a string on the form "start-end". This way we use resp for all structure and no extra parsing of a string is needed.

@zuiderkwast Actually Madelyn and I discussed this offline and thought with an extremely defragmented slot distribution, this would be cheaper for network out and has minimal parsing for the client.

I think @zuiderkwast has a good point. Dealing with exceptions always adds complexity.

Based on my quick back-of-the-envelope calculation, I think we could achieve an overall similar footprint (if not better) using the original proposal with one trick, which is to encode the length of the slot range as opposed to the end slot number. Here is an example

  1. Encode a slot range of [1000, 1000] using the special string with "-"

$9\r\n1000-1000\r\n

The total length is 15 bytes.

  1. Encode the same range using the original proposal but with the range length

*2\r\n:1000\r\n:1\r\n

The total length is also 15 bytes.

Note that a 4-digit slot number is where the two encoding schemes result in the same footprint. When the slot number is greater than 9999, the second encoding scheme yields a smaller footprint while, when the slot number is smaller than 1000, the first encoding scheme yields a smaller footprint. Therefore, in the most extreme case when no two slots are next to each other in the same shard, the second encoding scheme would actually yield an overall smaller footprint while remaining friendly to generic RESP parsing.

@madolson
Copy link
Contributor

madolson commented Feb 14, 2022

I like @PingXie's idea. It also a nice property of being easier to understand whether or not the boundary is inclusive.

Edit: After looking at the code some more, it seems inconsistent to have different mechanisms return with different values. The intention is this a "replacement" of sorts for cluster nodes, which actually gives slots in the form "a-b" or "a", so I think it's actually reasonable to re-use the code that generates the nodes output.

@PingXie
Copy link
Contributor

PingXie commented Feb 14, 2022

The intention is this a "replacement" of sorts for cluster nodes, which actually gives slots in the form "a-b" or "a", so I think it's actually reasonable to re-use the code that generates the nodes output.

@madolson did you actually mean a replacement for CLUSTER SLOTS? I thought one of the issues called out in #10168 was about the CLUSTER NODES output not being RESP compliant. Also, the output format implemented here is modeled after CLUSTER SLOTS than CLUSTER NODES.

That said, the use of "length" instead of the "end slot number" does break away from the CLUSTER SLOTS norm. We are probably looking at an additional ~55KB (6,4K*4 + 9K*3 + .9K*2) overhead in the most fragmented case if we were to go with full RESP compatibility. Assuming each node taking up 300B in the output, a 100-node cluster, and 15B per slot range with the "-" representation, the total output size is about 300B*100 + 15B * 16K = 270KB. 55KB is about 20% increase but again this is an extreme case. So here is what I see the tough call lies

  1. how likely would one run into this extreme case? it is probably on a long tail if I could guess
  2. if something has to give, would it be
    a. breaking away from the norm established by CLUSTER SLOTS and using "slot range length"; or
    b. using a non-RESP compliant representation (the "-"), which by the way also diverges from the CLUSTER SLOTS norm, though in a different way

@hpatro
Copy link
Contributor Author

hpatro commented Feb 14, 2022

@PingXie

In case a node owns a single slot, the network output would be something like this for the dash - representation.
$4\r\r1000\r\ninstead of $9\r\n1000-1000\r\n. Dash - comes into the picture once there are more than one contiguous slot owned by a node.

@zuiderkwast
Copy link
Contributor

I think the difference of 5-10 bytes per slot range is not very important, even if the difference is 50K.

The real difference is when compared to CLUSTER SLOTS, where the complete nodes are repeated for each slot range. Assuming a maximally fragmented cluster with one master and two replicas per shard, assuming 100B per node info, CLUSTER SLOTS is 3*100*16K = 4.8M.

So, even if CLUSTER SHARDS is 100K, we save 98% compared to CLUSTER SLOTS.

Among the options we have:

range single slot
1 "1000-1340" "1000" Yossi's suggestion
2 1) 1000 2) 1034 1) 1000 2) 1000 Madelyn's original suggestion
3 1) 1000 2) 35 1) 1000 2) 1 Ping's suggestion: start and length

I think 1 and 2 are acceptable, but 3 is rather obscure.

src/cluster.c Outdated
int start = -1;
void *shard_replylen = addReplyDeferredLen(c);
int shard_cnt = 0;
dict *nodeToSlotPair = dictCreate(&clusterNodesToSlotDictType);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating a temporary dict seems unnecessary. Maybe we can avoid it?

I think it's cheaper to iterate over the structures and add replies directly without allocating any extra structures, even if we might need to iterate over the nodes one extra time e.g. once just to count the shards to addReplyArrayLen(c, numshards);.

If we need to create a structure like this one, then perhaps it can be stored permanently or maybe we can change the existing structures to be more optimized for CLUSTER SHARDS (if possible).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could re-use the iterator logic for CLUSTER NODES that loops through all the slots an caches the response on each node. We can then re-iterate through the cached nodes to add the output.

@PingXie
Copy link
Contributor

PingXie commented Feb 14, 2022

I think the difference of 5-10 bytes per slot range is not very important, even if the difference is 50K.

+1

I think 1 and 2 are acceptable, but 3 is rather obscure.

Agreed and 2) would be my preference.

@hpatro
Copy link
Contributor Author

hpatro commented Feb 14, 2022

I think the difference of 5-10 bytes per slot range is not very important, even if the difference is 50K.

+1

I think 1 and 2 are acceptable, but 3 is rather obscure.

Agreed and 2) would be my preference.

@zuiderkwast / @madolson Do you guys have any preference ?

@madolson
Copy link
Contributor

madolson commented Feb 15, 2022

I'm fine with 1 or 2. I have a slight preference towards option 2 because it's the simplest for clients to implement.

@PingXie I did mean cluster nodes. Anyone that didn't like the cluster slots output switched to using CLUSTER NODES, so we were trying to provide a better output to get them back.

@madolson madolson linked an issue Feb 15, 2022 that may be closed by this pull request
@zuiderkwast zuiderkwast added the state:needs-doc-pr requires a PR to redis-doc repository label Feb 15, 2022
@hpatro hpatro marked this pull request as ready for review February 16, 2022 18:46
Copy link
Collaborator

@yossigo yossigo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked at the details yet, will do shortly, but I want to point out a potential ambiguity using the term "shard".

It has not been used by Redis Cluster until now, and in other contexts it has been used as either a synonym to "cluster node", or (as used in this PR), to describe a master and set of replicas sharing the same slots. To avoid confusion, I think we should avoid that using it here.

@hpatro
Copy link
Contributor Author

hpatro commented Feb 17, 2022

I haven't looked at the details yet, will do shortly, but I want to point out a potential ambiguity using the term "shard".

It has not been used by Redis Cluster until now, and in other contexts it has been used as either a synonym to "cluster node", or (as used in this PR), to describe a master and set of replicas sharing the same slots. To avoid confusion, I think we should avoid that using it here.

@yossigo With the new pubsub feature we decided to use "Sharded PubSub" as the terminology to describe a master/set of replica sharing the same slot. Shall we continue using the same definition here ?

@yossigo
Copy link
Collaborator

yossigo commented Feb 17, 2022

@hpatro I think "sharded" and "sharding" is not a problem, my only concern is around the current ambiguity around "shard" and whether it refers to a single cluster node or a set of nodes with the same hash slots (master + replicas).

@zuiderkwast
Copy link
Contributor

@yossigo I think the term shard in this sense is very convenient. The result of this command is an array of info per shard. Do you prefer TOPOLOGY or do have another suggestion(s)? Then we can try to compare pros and cons...

Another possible term is "slice". The term "network slicing" is used in telecom with a similar meaning.

@PingXie
Copy link
Contributor

PingXie commented Feb 17, 2022

The notion of "shard" was probably unofficially introduced already based on the documentation:

Sharded pubsub helps to scale the usage of pubsub in cluster mode. It restricts the propagation of message to be within the shard of a cluster.

I feel that it is very natural to go ahead and make it official. Besides, "shard" is a well established term in distributed systems and I don't see the use of "shard" in this context diverge from the norm. The other (less) popular term that I have seen is "partition" but that is quite foreign to Redis, to my best knowledge. I have not seen "slice" used in distributed storage/database systems.

@PingXie
Copy link
Contributor

PingXie commented Feb 18, 2022

While we are here, can we also consider having the node's epoch included in the output as well? The client application can bounce between nodes at run time so there are times it'd receive conflicting "shard" topology. An embedded epoch value would go a long way to help the client application resolve the collisions. @zuiderkwast this is somewhat related to our discussion on your topology change pubsub proposal #10150.

@madolson
Copy link
Contributor

Sorry for taking so long to weigh in, but I just also wanted voice that I think it's good to make the term "shard" more official. The only downside I can think of is it will become more complex if we ever want to introduce more flexible configurations. For example, consider the scenario if we wanted to allow masters to be replicas for other masters, so that on node failure we have the data replicated but all nodes are able to take writes.

@PingXie
Copy link
Contributor

PingXie commented Mar 9, 2022

We touched briefly on epoch, and no one had a strong opinion about it but me. We diverged briefly into a conversation about how we want to make config generations include all of the state information, so I think we settled on adding it in a future version once we have that. This would be independent of epoch though, so we don't want to proactively add it.

@madolson trying to get a better understanding of "make config generations include all of the state information". Are you talking about adding the remaining states like ping-sent, pong-recv, link-state,etc along with config-epoch in an all or none way? Or this is something else?

I don't know we can defend against people matching against something incorrectly, they should be doing exact matches and not partial matches.

Sure. I don't have a strong opinion here and I haven't seen updates to #10195 either. Agreed we can revisit this later when there is a real use case.

So, when moving to the CLUSTER SHARDS command, failed nodes should be ignored when building the cluster mapping, right?

Forgot to mention that I think we'd also need my changes (#10163) to ensure that failed nodes are included after a replica takes over the primary-ship in the shard. I can port it over after this PR is merged.

@madolson
Copy link
Contributor

@madolson trying to get a better understanding of "make config generations include all of the state information". Are you talking about adding the remaining states like ping-sent, pong-recv, link-state,etc along with config-epoch in an all or none way? Or this is something else?

@PingXie Include other metadata such as which replicas are serving data from which masters, updates for announced hostnames and IP addresses, any other future config state changes.

@madolson
Copy link
Contributor

Added documentation: redis/redis-doc#1815

@PingXie
Copy link
Contributor

PingXie commented Mar 14, 2022

We touched briefly on epoch, and no one had a strong opinion about it but me. We diverged briefly into a conversation about how we want to make config generations include all of the state information, so I think we settled on adding it in a future version once we have that. This would be independent of epoch though, so we don't want to proactively add it.

@PingXie Include other metadata such as which replicas are serving data from which masters, updates for announced hostnames and IP addresses, any other future config state changes.

Thanks @madolson. I would like to clarify that there is a legit and real scenario for having epochs in CLUSTER SHARD as it further reduces the likelihood of getting stale topology. This information is currently available via CLUSTER NODES, whose replacement (#10168) is also one of the reasons why the new CLUSTER SHARD command is introduced. With epoch, applications do not need to invoke both CLUSTER SHARD and CLUSTER NODES any more. Happy to add the epoch along with the change to preserve failed nodes next.

Copy link
Member

@oranagra oranagra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved in a core-team meeting. please merge away (and make sure the docs are updated too)

@oranagra oranagra added the release-notes indication that this issue needs to be mentioned in the release notes label Mar 15, 2022
@madolson madolson merged commit 45ccae8 into redis:unstable Mar 16, 2022

set node_0_id [R 0 CLUSTER MYID]

test "Kill a node and tell the replica to immediately takeover" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fails in my GH cluster test
https://github.com/enjoy-binbin/redis/runs/5563545978?check_suite_focus=true#step:9:681

02:18:22> Kill a node and tell the replica to immediately takeover: FAILED: caught an error in the test Port 30000 does not return available after killing instance.
Port 30000 does not return available after killing instance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will take a look here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bane of my existence, for some reason this test only fails when you run the entire suite but passes individually. I'm committing to deprecating this test suite now.

int tls_port = server.cluster_announce_tls_port ? server.cluster_announce_tls_port : server.tls_port;
if (tls_port) {
addReplyBulkCString(c, "tls-port");
addReplyLongLong(c, tls_port);
Copy link
Contributor

@enjoy-binbin enjoy-binbin Mar 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please take a look here (tls_port and the above port in tls mode)
the cluster-shards test failed in tls. (https://github.com/redis/redis/runs/5578694306?check_suite_focus=true#step:9:682)

with this patch, the test pass, but I'm not very familiar with tls mode

    int port = server.cluster_announce_port ? server.cluster_announce_port : server.port;
    int tls_port = server.cluster_announce_tls_port ? server.cluster_announce_tls_port : server.tls_port;
    if (port) {
        addReplyBulkCString(c, "port");
        if (tls_port)
            addReplyLongLong(c, node->pport);
        else
            addReplyLongLong(c, node->port);
        reply_count++;
    }

    if (tls_port) {
        addReplyBulkCString(c, "tls-port");
        addReplyLongLong(c, node->port);
        reply_count++;
    }

in case anyone will fix it, i would like to mention:

  1. since we deprecated cluster-slots, maybe we need to add a doc_flags in cluster-slots.json like others
  2. i notice cluster-shards.json, it's indented by two spaces, we use four spaces elsewhere
  3. and the three wait_for_condition 50 100 in 28-cluster-shards.tcl, there may be a timing issue, at least I've encountered it locally (centos7), so maybe change it to 500 100

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For 1, we marked it as deprecated. Was there something else that you meant?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i mean the doc_flags, like other commands?

        "deprecated_since": "6.2.0",
        "replaced_by": "`GEOSEARCH` and `GEOSEARCHSTORE` with the `BYRADIUS` argument",
        "doc_flags": [
            "DEPRECATED"
        ],

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I guess I didn't realize that was a thing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hpatro Do you have time to address this? Otherwise I can take a look tomorrow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@enjoy-binbin Regarding the port and tls-port logic: server.tls_port is always a TLS port and server.port is always a plaintext (AKA cleartext) port if specified. For the cluster bus, it depends on the server.tls_cluster variable:

if (server.tls_cluster) {
    // node->port is TLS port (always set)
    // node->pport is plaintext port, optional (if non-zero)
} else {
    // node->port is plaintext port (always set)
    // There is no TLS port
}

oranagra added a commit that referenced this pull request Mar 28, 2022
There are a few places that use a hard coded const of 128 to allocate a buffer for d2string.
Replace these with a clear macro.
Note that In theory, converting double into string could take as much as nearly 400 chars,
but since d2string uses `%g` and not `%f`, it won't pass some 40 chars.

unrelated:
restore some changes to auto generated commands.c that got accidentally reverted in #10293
oranagra added a commit that referenced this pull request Mar 29, 2022
This commit improve malloc efficiency of the slots_info_pairs mechanism in cluster.c
by changing adlist into an array being realloced with greedy growth mechanism

Recently the cluster tests are consistently failing when executed with ASAN in the CI.
I tried to track down the commit that started it, and it appears to be #10293.
Looking at the commit, i realize it didn't affect this test / flow, other than the
replacement of the slots_info_pairs from sds to list.

I concluded that what could be happening is that the slot range is very fragmented,
and that results in many allocations.
with sds, it results in one allocation and also, we have a greedy growth mechanism,
but with adlist, we just have many many small allocations.
this probably causes stress on ASAN, and causes it to be slow at termination.
@oranagra oranagra mentioned this pull request Apr 5, 2022
Comment on lines +5045 to +5048
while((ln = listNext(&li))) {
addReplyBulkLongLong(c, (unsigned long)listNodeValue(ln));
ln = listNext(&li);
addReplyBulkLongLong(c, (unsigned long)listNodeValue(ln));
Copy link
Contributor

@enjoy-binbin enjoy-binbin May 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the slots here is not integers, see #10680

inconsistent with top comment or CLUSTER SHARDS docs

127.0.0.1:5001> cluster shards
1) 1) "slots"
   2) 1) "0"
      2) "16383"

enjoy-binbin added a commit to enjoy-binbin/redis that referenced this pull request May 5, 2022
It used to returns slots as strings, like:
```
redis> cluster shards
1) 1) "slots"
   2) 1) "10923"
      2) "16383"
```

CLUSTER SHARDS docs and the top comment of redis#10293 says that it returns integers.
Note other commands like CLUSTER SLOTS, it returns slots as integers.
Use addReplyLongLong instead of addReplyBulkLongLong, now it returns slots as integers:
```
redis> cluster shards
1) 1) "slots"
   2) 1) (integer) 10923
      2) (integer) 16383
```

This is a small breaking change, introduced in 7.0.0 (7.0 RC3, redis#10293)

Fixes redis#10680
oranagra pushed a commit that referenced this pull request May 10, 2022
It used to returns slots as strings, like:
```
redis> cluster shards
1) 1) "slots"
   2) 1) "10923"
      2) "16383"
```

CLUSTER SHARDS docs and the top comment of #10293 says that it returns integers.
Note other commands like CLUSTER SLOTS, it returns slots as integers.
Use addReplyLongLong instead of addReplyBulkLongLong, now it returns slots as integers:
```
redis> cluster shards
1) 1) "slots"
   2) 1) (integer) 10923
      2) (integer) 16383
```

This is a small breaking change, introduced in 7.0.0 (7.0 RC3, #10293)

Fixes #10680
enjoy-binbin added a commit to enjoy-binbin/redis that referenced this pull request Jul 31, 2023
…0683)

It used to returns slots as strings, like:
```
redis> cluster shards
1) 1) "slots"
   2) 1) "10923"
      2) "16383"
```

CLUSTER SHARDS docs and the top comment of redis#10293 says that it returns integers.
Note other commands like CLUSTER SLOTS, it returns slots as integers.
Use addReplyLongLong instead of addReplyBulkLongLong, now it returns slots as integers:
```
redis> cluster shards
1) 1) "slots"
   2) 1) (integer) 10923
      2) (integer) 16383
```

This is a small breaking change, introduced in 7.0.0 (7.0 RC3, redis#10293)

Fixes redis#10680
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approval-needed Waiting for core team approval to be merged release-notes indication that this issue needs to be mentioned in the release notes state:major-decision Requires core team consensus state:needs-doc-pr requires a PR to redis-doc repository

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

New Redis Cluster topology command

8 participants