Improve slot migration reliability#10517
Conversation
|
Right now slot migration is entirely administrator driven, where they execute a lot of commands and handles reconciliation of failures. This PR definitely does a lot to make this more "managed", however I still believe the fundamental gap is that slot migration should just be exposed as an atomic operation to admins. Replicating "setslot" also introduces the awkward situation where there are now two competing sources of truth for slot ownership, the first from the clusterbus and the second from the replication stream. It might make some sense to have a shard own its slot, and then passively observe the state changes from other shards. |
The "atomicity" (for lack of a better term) is a nice thing to have but I am not sure if it is truly the core problem or it could be dealt with separately. In other words, I wonder if there is a way to solve the two key issues in slot migration (no replication of migration states and the fragility in the slot finalization logic) without changing the current slot migration protocol.
I am guessing "setslot node" is the one causing some uneasiness and I agree "setslot node" is not strictly required. "setslot migrating" and "setslot importing" on the other hand should be fine. I added you to the main thread for #6339. I will also think some more about your proposal on #2807. |
1. Added a new "shard_id" field to "cluster nodes" output and nodes.conf after "hostname" 2. Added a new entry "shard_id" to "cluster shards" at the beginning of every shard 3. Added a new PING extension to propagate "shard_id" 4. Handled upgrade from pre-7.0 releases automatically 5. Refactored PING extension assembling/parsing logic
|
@zuiderkwast @madolson I updated the PR to enforce the ordering of
Where This PR is now complete and it handles the two core issues as we discussed earlier (no replication of migration states and the fragility in the slot finalization logic). This PR should be compatible with the existing slot migration protocol. Thoughts? |
|
I like this change and I think it can be merged regardless of the other plans for CLUSTER MIGRATE, because it improves the current migration process, which we can't deprecate any time soon anyway. The change is backwards compatible with old client applications and old Redis nodes. Old replicas would simply error (ignore) the replicated SETSLOT. What it doesn't solve is a migration interrupted by a failover. The migration doesn't continue automatically. It can be handled by The problem is that there's still manual administration required in this case. If we can do anything to automate this scenario? For example, can we make |
I think both |
|
Just noticed: When |
@zuiderkwast it looks to me that replicas would panic on WRITE commands only (if there was a disk error). I turned on PS, now that 7.0 has GA'd, I will get my shard ID change in shape next. There are a few places in this PR where I need to detect the same-shard relationship and I think they could benefit from the shard ID change. |
Change-Id: Id2acfe6e9d11351411af7be2ec45c25b79df3ecd
…redis into oss_shard_id Change-Id: I722918bcd0b12c13139e4183916e59f8f63f40db
1. Replicate the "CLUSTER SETSLOT" command
2. Update importing_slots_from and migrating_slots_to when processing slots config updates
3. Replicate open slots when performing full sync
4. Auto-clear migrating_slots_from for empty slots when they are claimed by other primaries
5. Add a new block type BLOCKED_WAIT_RERUN and a new client flag CLIENT_RERUN_COMMAND to support the new replication-then-execution use case
6. Order "setslot node" steps on destination shard
1). Client issues "setslot node" against destination primary
2). Destination primary replicates "setslot node" to all replicas
3). On successful replication, destination primary executes "setslot node"
4). Destination primary completes "setslot node"
5). The following two steps can happen in parallel
a. New slot ownership is gossiped from destination to source
b. Client issues "setslot node" on source primary
Change-Id: Ib68bcfd153d7f755b8ad84979cc208e86a2aac9f
|
FYI @yossigo @oranagra - with this change both 20* and 21* cluster migration tests pass consistently without further modifications. Also I believe there is a way to extend this solution to achieve a forkless solution for #2807 as well though I haven't started yet. Curious to hear your thoughts on this approach. Let me know if there is anything I can help to make some progress on this PR. |
|
I'd like to see these incremental improvements to slot migration in 7.2, since I don't expect atomic slot migration to be ready soon enough. @redis/core-team WDYT? |
|
I agree about getting these improvements in 7.2, we don't really have much of a concrete plan for when 7.2 is going to launch. |
|
"it is already broken for many years (indicating that it's not urgent)" and I believe we will drop it later (IMHO, it's very strange to replicate an ADMIN command). So, I suggest we consider using the cluster bus to implement it, if we don't have enough time, we can put it into 8.0. |
|
We suffer from these problems when scaling, when many slots are moved. We have seen slots that get two or zero owners. And there is potential data loss. The long term solution would be either Atomic Slot Migration or the Cluster V2 "flotilla". (Is there any decision about that?) In the mean time, I though this PR was supposed to be a quick backwards compatible fix, yet it has been reviewed since before 7.0 was released.
How can reliability issues like this be not urgent? AWS has their own slot migration implementation. Why? Can it be that open source slot migration has never really been reliable... |
The cynic in me is that most people are using managed providers that either do cluster different or wrote around it (like AWS as you mentioned). You're right though, this should be a higher priority than it is.
I think there is a good debate here, and I have mixed feelings as well. I think it could be reduced into two different ways of thinking:
I think ultimately one is the more correct way to think about the data transfer, which may be divergent from the way flotilla is thinking about it and how we do it today. It's also divergent from zhao's "So, I suggest we consider using the cluster bus to implement it". I don't agree with that.
We discussed this and I agreed with the approach you are suggesting originally, but Ping made the case that it was a lot more complexity.
I just want to say I don't think this is all that fair of an assessment. I still believe there are structural issues with the way slot migration is broken today, and we need more effort to fix it. There are other issues as well such as the epoch after bump issue. |
|
Of course, I support fixing this problem. However, but the key point is not whether it is urgent, but whether the method of fixing it is correct. As far as I am concerned, using the replication stream method is wrong. @PingXie Have you tried the method that I suggested in #10517 (comment)? IIUC, the discussion before was mainly about using the cluster bus to broadcast. My suggestion is to spread it only to the affected replica, just like how the replication affects the nodes. |
I'm not convinced about replication being the wrong approach, and I'd like to understand your concerns regarding the challenges with "multiplexing" that you mentioned earlier. It would be really helpful if you can expand on it some more or maybe share some pointers. FWIW, we currently have a few commands marked for replication. While they may not be categorized as ADMIN commands, the argument of data Vs non-data can be quite blurry (for example, I think it's worth diving deeper into this discussion of replication vs broadcasting, data vs non-data, etc here on GitHub, even if it's more philosophical in nature. The knowledge sharing would be extremely valuable for the community as well as for me personally.
I think the idea of sharded broadcast is quite interesting (thanks for the suggestion!). I'm definitely open to giving it a shot and seeing how it plays out. However, I don't think it is a guaranteed solution. Until I spend some quality time exploring the idea, I can't be sure what other issues might arise. Note that the current PR has already gone through multiple peer reviews and lots of testing, including our internal testing. If we consider getting a good grasp of this idea as a prerequisite for making a decision on whether to fix the reliability issue in 7.2, then the answer seems pretty clear to me: we won't be fixing it in 7.2. So, here's an alternative approach I'd like to suggest. How about we focus on addressing the remaining feedback on the existing implementation and merge it into 7.2? At the same time, we can start evaluating the sharded broadcast idea for the future versions. I do have one concern, though. The current design, which relies on replication, gives the caller (redis-cli in this case) explicit control over the number of replicas that need to acknowledge the receipt of slot migration states using the |
Indeed, the root cause of all these problems can be traced back to the absence of a consensus-driven epoch bump. Until this fundamental aspect is addressed, I think we are just playing a never-ending game of "whack-a-mole." |
I don't think multiplexing is right here. The point of multiplexing is to support out of band messages, in which case we would have used the cluster-bus as was mentioned. We still can run into issues when a replica fails before receiving a message that it's primary has started importing or migrating a slot. The current implementation relies on the cli to send a wait, but we could have also changed the cli to wait for the replica to acknowledge the new state. I still have a preference for the current replication based implementation. |
|
One more point for having SETSLOT in the replication stream is consistency for clients that are reading from replicas. When a key is migrated, it is deleted from replicas. If the replica knows the slot is being migrated, it can return an ASK-redirect even if DEL comes immediately after SETSLOT MIGRATING. (#11312 is not yet supported but it's easy to implement after this.) The cluster bus is not synchronized with the replication stream. If we use the cluster bus for SETSLOT, the admin would need to wait for all replicas to ack the new state before starting to move keys, if we want replica read consistency. |
|
I still believe cluster bus is the right way. But, maybe we can change our perspective. Why do we have to rely on replication or cluster bus to propagate the metadata? Isn't it simpler and more explicit to execute "cluster setslot with master node id" on the replica nodes directly? After all, "REPLICATE" also needs to be explicitly specified by the administrator. For the old version of the cluster manager, this is a breaking change. It's better to let the administrator be responsible for making sure that the replica nodes are aware of the migration status. |
Although your proposal is simpler, it is adding failure points into the system and also requires all existing systems to adopt the new API calls. I don't think we should be pushing too much onto administrators, especially with flotilla wanting to be more of a stateless controller. I think the main decision point is this though, how do we want the migrating and importing state to be recognized on the replica. Although we haven't implemented it yet, I also want to discuss how it may relate to Cluster V2 (flotilla) as well as an atomic slot migration. Option 1: Have the slot state be transferred through the replication state. Ideal state is a RDB full sync includes slot information + replicating set slot commands.
Option 2: Have the slot state be transferred through the cluster state
|
I haven't had a chance to evaluate the cluster bus solution in depth, but one thing I'm pretty certain about is that we'll need to establish a new mechanism parallel to 'WAIT'. This will be essential to attain parity with the replication solution, which enables the caller to explicitly synchronize with replicas regarding the slot migration states. That's why I still think that this issue fundamentally revolves around replication.
Philosophically speaking, I beg to differ with this statement. If we were to adopt the approach of "admin handling everything," I believe there would be minimal need to address any reliability concerns since, ultimately, the admin would be responsible. IMO, the main objective here is to reduce human intervention, or the involvement of the control plane to a greater extent, allowing the Redis service (managed or not) to have an opportunity to achieve 3/4/5 9s of SLOs by itself. |
Seems you misunderstood me, my point is that the administrator is responsible for all metadata, such as passwords, common configuration items such as "appendonly", and cluster configuration items such as "cluster-allow-reads-when-down". It is the responsibility of the administrator to ensure that these are set correctly and consistently on both the master and replicas. In my opinion, slot information is also metadata or a configuration item, and it is the right choice for the administrator to ensure consistency between the master and replicas. Even when using the "WAIT" command, it is not transparent to the administrator, and the administrator is still responsible for making the final consistent judgment. |
|
Requiring manual intervention is quite bad IMO. It's too easy to mess up. A cluster does guarantee some things by itself, like making sure there is exactly one master per shard. Guaranteeing slot ownership consistency is one such thing it should do IMO. If some configuration (or ACL, etc.) can be inconsistent between different nodes, that's a design flaw IMO and we better fix that; propagate that config to make sure all nodes have consistent configuration. That's a different discussion though. |
|
Flotilla orchestrates all slot changes through the Topology Director (the admin never needs to communicate directly with the data nodes), and guarantees that a newly elected primary node is at least as up to date on slot ownership than any of its replicas. There's a potential for stale reads when read-from-replica is enabled -- but that's inherent in any read from replica usecase.... |
I strongly agree with this approach, and I also want to point out that in this case, the Topology Director is the real administrator, as it is doing the job of an administrator. |
As far as I know, in areas outside of Redis, there are many systems that can achieve consistent reads on replica databases, such as Polardb. However, this is another interesting topic. |
That's very bad analogy IMO. The Topology Director is a raft cluster, which achieves consensus among shards. The consistency between master and replicas is handled by Failover Coordinator, which is another raft cluster per shard. These are part of the cluster, not admins. An administrator is a user, typically a human, orchestrating the whole thing. An admin should not be able to induce inconsistent slot ownerships by sending SETSLOT differently to different nodes. |
|
I don't think so, an administrator is not a specific person, it is a role, as you said, responsible for resource orchestration, metadata management, and other tasks. |
|
@soloestoy Sure -- I meant in the Redis replica sense.... Not as a global statement for all dist systems :) |
|
Hi @soloestoy and team, revisiting our conversation on this PR, I've been considering the cluster bus method. My preference for the replication approach primarily stems from how the |
|
I'm still waiting for a merge. :) It's a good improvement. A bonus point for making replicas aware of migrations: It opens for replicas to return -ASK redirects. Currently, they don't know about ongoing migration so they just return null for already migrated keys. |
|
Hey team, can we bring this PR to the table at your next core team meeting? It feels like the ideal time to reach a decision and keep things moving. I'm ready to walk you through my thought process and am keen to hear your perspectives. |
|
all tests have passed. This PR is ready for merge. |
Fix Issues with slot migration #6339
CLUSTER SETSLOT,REPLICATEwhich indicates to the primary that it should also replicate the migration state to its replicas if it is successful. This is used so that replicas will be aware of ongoing migrations so that they can serve ASKING requests as well as continue after a failover.Cases which are not covered:
This doesn't conflict with future optimizations we want to make for flotilla, which is more about offloading the work of transferring the data away from the cli to the nodes in the cluster.