Skip to content

Improve slot migration reliability#10517

Closed
PingXie wants to merge 50 commits into
redis:unstablefrom
PingXie:slot_migration2
Closed

Improve slot migration reliability#10517
PingXie wants to merge 50 commits into
redis:unstablefrom
PingXie:slot_migration2

Conversation

@PingXie

@PingXie PingXie commented Apr 4, 2022

Copy link
Copy Markdown
Contributor

Fix Issues with slot migration #6339

  1. Add a new argument to the CLUSTER SETSLOT, REPLICATE which indicates to the primary that it should also replicate the migration state to its replicas if it is successful. This is used so that replicas will be aware of ongoing migrations so that they can serve ASKING requests as well as continue after a failover.
  2. Update importing_slots_from and migrating_slots_to when processing slots config updates
  3. Replicate open slots when performing full sync. This is in addition to 1.
  4. Auto-clear migrating_slots_from for empty slots when they are claimed by other primaries

Cases which are not covered:

  1. AOF and RDBs are not covered. Failures where the master is restarted and starts back up will rely on the nodes.conf file to have the migration state.

This doesn't conflict with future optimizations we want to make for flotilla, which is more about offloading the work of transferring the data away from the cli to the nodes in the cluster.

@madolson

madolson commented Apr 4, 2022

Copy link
Copy Markdown
Contributor

Right now slot migration is entirely administrator driven, where they execute a lot of commands and handles reconciliation of failures. This PR definitely does a lot to make this more "managed", however I still believe the fundamental gap is that slot migration should just be exposed as an atomic operation to admins.

Replicating "setslot" also introduces the awkward situation where there are now two competing sources of truth for slot ownership, the first from the clusterbus and the second from the replication stream. It might make some sense to have a shard own its slot, and then passively observe the state changes from other shards.

@PingXie

PingXie commented Apr 5, 2022

Copy link
Copy Markdown
Contributor Author

Right now slot migration is entirely administrator driven, where they execute a lot of commands and handles reconciliation of failures. This PR definitely does a lot to make this more "managed", however I still believe the fundamental gap is that slot migration should just be exposed as an atomic operation to admins.

The "atomicity" (for lack of a better term) is a nice thing to have but I am not sure if it is truly the core problem or it could be dealt with separately. In other words, I wonder if there is a way to solve the two key issues in slot migration (no replication of migration states and the fragility in the slot finalization logic) without changing the current slot migration protocol.

Replicating "setslot" also introduces the awkward situation where there are now two competing sources of truth for slot ownership, the first from the clusterbus and the second from the replication stream. It might make some sense to have a shard own its slow, and then passively observe the state changes from other shards.

I am guessing "setslot node" is the one causing some uneasiness and I agree "setslot node" is not strictly required. "setslot migrating" and "setslot importing" on the other hand should be fine.

I added you to the main thread for #6339. I will also think some more about your proposal on #2807.

Ping Xie added 3 commits April 5, 2022 18:59
1. Added a new "shard_id" field to "cluster nodes" output and nodes.conf after "hostname"
2. Added a new entry "shard_id" to "cluster shards" at the beginning of every shard
3. Added a new PING extension to propagate "shard_id"
4. Handled upgrade from pre-7.0 releases automatically
5. Refactored PING extension assembling/parsing logic
@PingXie

PingXie commented Apr 9, 2022

Copy link
Copy Markdown
Contributor Author

@zuiderkwast @madolson I updated the PR to enforce the ordering of SETSLOT NODE as follows

  1. Client C issues SETSLOT n NODE B against node B
  2. Node B replicates SETSLOT n NODE B to all of its replicas, such as B', B'', etc
  3. On replication completion, node B executes SETSLOT n NODE B and returns control back to client C
  4. The following steps can happen in parallel
  • Client C issues SETSLOT n NODE B against node A
  • node B gossips its new slot ownership to the cluster including A, A', etc

Where A is the source primary and B is the destination primary.

This PR is now complete and it handles the two core issues as we discussed earlier (no replication of migration states and the fragility in the slot finalization logic). This PR should be compatible with the existing slot migration protocol. Thoughts?

@zuiderkwast

Copy link
Copy Markdown
Contributor

I like this change and I think it can be merged regardless of the other plans for CLUSTER MIGRATE, because it improves the current migration process, which we can't deprecate any time soon anyway. The change is backwards compatible with old client applications and old Redis nodes. Old replicas would simply error (ignore) the replicated SETSLOT.

What it doesn't solve is a migration interrupted by a failover. The migration doesn't continue automatically. It can be handled by redis-cli --cluster fix which can continue the migration (which it can do even without this PR with some guesswork and using CLUSTER COUNTKEYSINSLOT, but now with less guesswork when there is one importing and one migrating node).

The problem is that there's still manual administration required in this case. If we can do anything to automate this scenario? For example, can we make redis-cli --cluster rebalance detect a failover during migration and automatically continue the migration after a failover?

@PingXie

PingXie commented Apr 13, 2022

Copy link
Copy Markdown
Contributor Author

What it doesn't solve is a migration interrupted by a failover. The migration doesn't continue automatically. It can be handled by redis-cli --cluster fix which can continue the migration (which it can do even without this PR with some guesswork and using CLUSTER COUNTKEYSINSLOT, but now with less guesswork when there is one importing and one migrating node).

The problem is that there's still manual administration required in this case. If we can do anything to automate this scenario? For example, can we make redis-cli --cluster rebalance detect a failover during migration and automatically continue the migration after a failover?

I think both cluster fix and cluster rebalance are good callouts. I can certainly help improve redis-cli after this PR is merged.

@zuiderkwast

Copy link
Copy Markdown
Contributor

Just noticed: When CLUSTER SETSLOT ... MIGRATING|IMPORTING is replicated, until now this was backward compatible, because replicase used to ignore failing commands, but with #10504 included in Redis 7, replicas can (optionally, configurable) panic on replication errors. This means that this feature is not compatible with a replica running Redis 7.0 if it's configured to panic on replication errors.

@PingXie

PingXie commented Apr 28, 2022

Copy link
Copy Markdown
Contributor Author

Just noticed: When CLUSTER SETSLOT ... MIGRATING|IMPORTING is replicated, until now this was backward compatible, because replicase used to ignore failing commands, but with #10504 included in Redis 7, replicas can (optionally, configurable) panic on replication errors. This means that this feature is not compatible with a replica running Redis 7.0 if it's configured to panic on replication errors.

@zuiderkwast it looks to me that replicas would panic on WRITE commands only (if there was a disk error). I turned on MAY_REPLICATE for SETSLOT but left WRITE out so I think the change should still be compatible?

PS, now that 7.0 has GA'd, I will get my shard ID change in shape next. There are a few places in this PR where I need to detect the same-shard relationship and I think they could benefit from the shard ID change.

Ping Xie added 3 commits April 28, 2022 05:43
Change-Id: Id2acfe6e9d11351411af7be2ec45c25b79df3ecd
…redis into oss_shard_id

Change-Id: I722918bcd0b12c13139e4183916e59f8f63f40db
1. Replicate the "CLUSTER SETSLOT" command
2. Update importing_slots_from and migrating_slots_to when processing slots config updates
3. Replicate open slots when performing full sync
4. Auto-clear migrating_slots_from for empty slots when they are claimed by other primaries
5. Add a new block type BLOCKED_WAIT_RERUN and a new client flag CLIENT_RERUN_COMMAND to support the new replication-then-execution use case
6. Order "setslot node" steps on destination shard
    1). Client issues "setslot node" against destination primary
    2). Destination primary replicates "setslot node" to all replicas
    3). On successful replication, destination primary executes "setslot node"
    4). Destination primary completes "setslot node"
    5). The following two steps can happen in parallel
        a. New slot ownership is gossiped from destination to source
        b. Client issues "setslot node" on source primary

Change-Id: Ib68bcfd153d7f755b8ad84979cc208e86a2aac9f
@PingXie

PingXie commented May 1, 2022

Copy link
Copy Markdown
Contributor Author

FYI. The latest push includes shard id changes from #10536. I will rebase once #10536 is merged.

@PingXie

PingXie commented May 2, 2022

Copy link
Copy Markdown
Contributor Author

FYI @yossigo @oranagra - with this change both 20* and 21* cluster migration tests pass consistently without further modifications. Also I believe there is a way to extend this solution to achieve a forkless solution for #2807 as well though I haven't started yet. Curious to hear your thoughts on this approach. Let me know if there is anything I can help to make some progress on this PR.

@zuiderkwast

Copy link
Copy Markdown
Contributor

I'd like to see these incremental improvements to slot migration in 7.2, since I don't expect atomic slot migration to be ready soon enough. @redis/core-team WDYT?

@yossigo yossigo requested a review from madolson September 8, 2022 14:01
@madolson

madolson commented Sep 8, 2022

Copy link
Copy Markdown
Contributor

I agree about getting these improvements in 7.2, we don't really have much of a concrete plan for when 7.2 is going to launch.

@PingXie

PingXie commented Sep 9, 2022

Copy link
Copy Markdown
Contributor Author

Please hold off on the review of this PR. #10536 is the prerequisite to this PR and it has gone through some churns recently. As soon as #10536 gets merged, I will rebase this PR and ping the thread for a review. Stay tuned...

@soloestoy

Copy link
Copy Markdown
Contributor

"it is already broken for many years (indicating that it's not urgent)" and I believe we will drop it later (IMHO, it's very strange to replicate an ADMIN command). So, I suggest we consider using the cluster bus to implement it, if we don't have enough time, we can put it into 8.0.

@zuiderkwast

Copy link
Copy Markdown
Contributor

We suffer from these problems when scaling, when many slots are moved. We have seen slots that get two or zero owners. And there is potential data loss.

The long term solution would be either Atomic Slot Migration or the Cluster V2 "flotilla". (Is there any decision about that?) In the mean time, I though this PR was supposed to be a quick backwards compatible fix, yet it has been reviewed since before 7.0 was released.

although it is already broken for many years (indicating that it's not urgent)

How can reliability issues like this be not urgent?

AWS has their own slot migration implementation. Why? Can it be that open source slot migration has never really been reliable...

@madolson

madolson commented Jun 28, 2023

Copy link
Copy Markdown
Contributor

How can reliability issues like this be not urgent?

The cynic in me is that most people are using managed providers that either do cluster different or wrote around it (like AWS as you mentioned). You're right though, this should be a higher priority than it is.

didn't look deep into the codes, but to be honest, I don't like using the replication stream to propagate MIGRATING/IMPORTING status, this makes the replication stream muddier.

I think there is a good debate here, and I have mixed feelings as well. I think it could be reduced into two different ways of thinking:

  1. Slot ownership is data. Whether or not a slot is currently owned and can be served by a node can and should come from the replication stream. A replica will receive the slot state during load and will receive the state changes in the stream. This prevents subtle divergences which can happen, such as the replica thinking it can serve data that was deleted.
  2. Slot ownership is independent of data. This is more of the current scheme, which is to say the replica will eventually learn what is going on from the primary based off the cluster state.

I think ultimately one is the more correct way to think about the data transfer, which may be divergent from the way flotilla is thinking about it and how we do it today. It's also divergent from zhao's "So, I suggest we consider using the cluster bus to implement it". I don't agree with that.

For newly added replicas, we can transmit the state of all slots when they perform the handshake with the master. Besides, currently, adding CLUSTER SETSLOT command to the replication stream won't enable new replicas to perceive MIGRATING/IMPORTING status since this information is not included in the RDB during full synchronization.

We discussed this and I agreed with the approach you are suggesting originally, but Ping made the case that it was a lot more complexity.

In the mean time, I though this PR was supposed to be a quick backwards compatible fix

I just want to say I don't think this is all that fair of an assessment. I still believe there are structural issues with the way slot migration is broken today, and we need more effort to fix it. There are other issues as well such as the epoch after bump issue.

@soloestoy

Copy link
Copy Markdown
Contributor

Of course, I support fixing this problem. However, but the key point is not whether it is urgent, but whether the method of fixing it is correct. As far as I am concerned, using the replication stream method is wrong.

@PingXie Have you tried the method that I suggested in #10517 (comment)? IIUC, the discussion before was mainly about using the cluster bus to broadcast. My suggestion is to spread it only to the affected replica, just like how the replication affects the nodes.

@PingXie

PingXie commented Jun 29, 2023

Copy link
Copy Markdown
Contributor Author

Of course, I support fixing this problem. However, but the key point is not whether it is urgent, but whether the method of fixing it is correct. As far as I am concerned, using the replication stream method is wrong.

I'm not convinced about replication being the wrong approach, and I'd like to understand your concerns regarding the challenges with "multiplexing" that you mentioned earlier. It would be really helpful if you can expand on it some more or maybe share some pointers.

FWIW, we currently have a few commands marked for replication. While they may not be categorized as ADMIN commands, the argument of data Vs non-data can be quite blurry (for example, evalsha). On the other hand, one could argue that these are exceptional cases, which would be acceptable if we can clearly articulate the differences.

eval.json:13:            "MAY_REPLICATE",
evalsha.json:13:            "MAY_REPLICATE",
fcall.json:13:            "MAY_REPLICATE",
pfcount.json:11:            "MAY_REPLICATE"
publish.json:14:            "MAY_REPLICATE",
spublish.json:14:            "MAY_REPLICATE"

I think it's worth diving deeper into this discussion of replication vs broadcasting, data vs non-data, etc here on GitHub, even if it's more philosophical in nature. The knowledge sharing would be extremely valuable for the community as well as for me personally.

@PingXie Have you tried the method that I suggested in #10517 (comment)? IIUC, the discussion before was mainly about using the cluster bus to broadcast. My suggestion is to spread it only to the affected replica, just like how the replication affects the nodes.

I think the idea of sharded broadcast is quite interesting (thanks for the suggestion!). I'm definitely open to giving it a shot and seeing how it plays out. However, I don't think it is a guaranteed solution. Until I spend some quality time exploring the idea, I can't be sure what other issues might arise. Note that the current PR has already gone through multiple peer reviews and lots of testing, including our internal testing. If we consider getting a good grasp of this idea as a prerequisite for making a decision on whether to fix the reliability issue in 7.2, then the answer seems pretty clear to me: we won't be fixing it in 7.2.

So, here's an alternative approach I'd like to suggest. How about we focus on addressing the remaining feedback on the existing implementation and merge it into 7.2? At the same time, we can start evaluating the sharded broadcast idea for the future versions. I do have one concern, though. The current design, which relies on replication, gives the caller (redis-cli in this case) explicit control over the number of replicas that need to acknowledge the receipt of slot migration states using the WAIT command. This is an important aspect of the user/admin experience that ideally should be maintained. If we believe WAIT is the right approach (orthogonal to the replication vs broadcasting discussion) or we see no concern of dropping it, then I don't foresee any issues with replacing the internal implementation in the future.

@PingXie

PingXie commented Jun 29, 2023

Copy link
Copy Markdown
Contributor Author

I just want to say I don't think this is all that fair of an assessment. I still believe there are structural issues with the way slot migration is broken today, and we need more effort to fix it. There are other issues as well such as the epoch after bump issue.

Indeed, the root cause of all these problems can be traced back to the absence of a consensus-driven epoch bump. Until this fundamental aspect is addressed, I think we are just playing a never-ending game of "whack-a-mole."

@madolson

madolson commented Jul 1, 2023

Copy link
Copy Markdown
Contributor

I'm not convinced about replication being the wrong approach, and I'd like to understand your concerns regarding the challenges with "multiplexing" that you mentioned earlier.

I don't think multiplexing is right here. The point of multiplexing is to support out of band messages, in which case we would have used the cluster-bus as was mentioned. We still can run into issues when a replica fails before receiving a message that it's primary has started importing or migrating a slot. The current implementation relies on the cli to send a wait, but we could have also changed the cli to wait for the replica to acknowledge the new state. I still have a preference for the current replication based implementation.

@zuiderkwast

Copy link
Copy Markdown
Contributor

One more point for having SETSLOT in the replication stream is consistency for clients that are reading from replicas.

When a key is migrated, it is deleted from replicas. If the replica knows the slot is being migrated, it can return an ASK-redirect even if DEL comes immediately after SETSLOT MIGRATING. (#11312 is not yet supported but it's easy to implement after this.)

The cluster bus is not synchronized with the replication stream. If we use the cluster bus for SETSLOT, the admin would need to wait for all replicas to ack the new state before starting to move keys, if we want replica read consistency.

@soloestoy

soloestoy commented Jul 4, 2023

Copy link
Copy Markdown
Contributor

I still believe cluster bus is the right way.

But, maybe we can change our perspective. Why do we have to rely on replication or cluster bus to propagate the metadata? Isn't it simpler and more explicit to execute "cluster setslot with master node id" on the replica nodes directly? After all, "REPLICATE" also needs to be explicitly specified by the administrator. For the old version of the cluster manager, this is a breaking change. It's better to let the administrator be responsible for making sure that the replica nodes are aware of the migration status.

@madolson

madolson commented Jul 4, 2023

Copy link
Copy Markdown
Contributor

Why do we have to rely on replication or cluster bus to propagate the metadata? Isn't it simpler and more explicit to execute "cluster setslot with master node id" on the replica nodes directly? After all, "REPLICATE" also needs to be explicitly specified by the administrator.

Although your proposal is simpler, it is adding failure points into the system and also requires all existing systems to adopt the new API calls. I don't think we should be pushing too much onto administrators, especially with flotilla wanting to be more of a stateless controller.

I think the main decision point is this though, how do we want the migrating and importing state to be recognized on the replica. Although we haven't implemented it yet, I also want to discuss how it may relate to Cluster V2 (flotilla) as well as an atomic slot migration.

Option 1: Have the slot state be transferred through the replication state. Ideal state is a RDB full sync includes slot information + replicating set slot commands.

  • Will make sure the replica is serving replication data with snapshot consistency.
  • Clients can use the wait command to make sure that data is replicated. Clients are required to make a change to get the benefit of data being replicated though.
  • Atomic slot migration will need a way to indicate to clients that they have the full sync of a slot, to indicate they can successfully serve the data.
  • Flotilla will set importing and migrating state and rely on the server to coordinate transferring the data.

Option 2: Have the slot state be transferred through the cluster state

  • Replica may serve a slot is empty or has stale data for since the data has already been fully migrated to another slot. The most serious places I've seen this is right have a sync, the replica might believe it "owns" a slot, but have partial data for it.
  • Clients would need to implement logic to check to see that all replicas have acknowledged the new state.
  • Atomic slot migrations would need to wait for replicas to acknowledge the importing/migrating state through cluster messages.
  • Flotilla would replace the importing/migrating state messages sent by the client.

@PingXie

PingXie commented Jul 5, 2023

Copy link
Copy Markdown
Contributor Author

I still believe cluster bus is the right way.

I haven't had a chance to evaluate the cluster bus solution in depth, but one thing I'm pretty certain about is that we'll need to establish a new mechanism parallel to 'WAIT'. This will be essential to attain parity with the replication solution, which enables the caller to explicitly synchronize with replicas regarding the slot migration states. That's why I still think that this issue fundamentally revolves around replication.

It's better to let the administrator be responsible for making sure that the replica nodes are aware of the migration status.

Philosophically speaking, I beg to differ with this statement. If we were to adopt the approach of "admin handling everything," I believe there would be minimal need to address any reliability concerns since, ultimately, the admin would be responsible. IMO, the main objective here is to reduce human intervention, or the involvement of the control plane to a greater extent, allowing the Redis service (managed or not) to have an opportunity to achieve 3/4/5 9s of SLOs by itself.

@soloestoy

Copy link
Copy Markdown
Contributor

Philosophically speaking, I beg to differ with this statement. If we were to adopt the approach of "admin handling everything," I believe there would be minimal need to address any reliability concerns since, ultimately, the admin would be responsible. IMO, the main objective here is to reduce human intervention, or the involvement of the control plane to a greater extent, allowing the Redis service (managed or not) to have an opportunity to achieve 3/4/5 9s of SLOs by itself.

Seems you misunderstood me, my point is that the administrator is responsible for all metadata, such as passwords, common configuration items such as "appendonly", and cluster configuration items such as "cluster-allow-reads-when-down". It is the responsibility of the administrator to ensure that these are set correctly and consistently on both the master and replicas. In my opinion, slot information is also metadata or a configuration item, and it is the right choice for the administrator to ensure consistency between the master and replicas. Even when using the "WAIT" command, it is not transparent to the administrator, and the administrator is still responsible for making the final consistent judgment.

@zuiderkwast

Copy link
Copy Markdown
Contributor

Requiring manual intervention is quite bad IMO. It's too easy to mess up.

A cluster does guarantee some things by itself, like making sure there is exactly one master per shard. Guaranteeing slot ownership consistency is one such thing it should do IMO.

If some configuration (or ACL, etc.) can be inconsistent between different nodes, that's a design flaw IMO and we better fix that; propagate that config to make sure all nodes have consistent configuration. That's a different discussion though.

@ushachar

ushachar commented Jul 5, 2023

Copy link
Copy Markdown
Contributor

Flotilla orchestrates all slot changes through the Topology Director (the admin never needs to communicate directly with the data nodes), and guarantees that a newly elected primary node is at least as up to date on slot ownership than any of its replicas.
(Since a replica won't be promoted before it acks the latest topology known by the Failover Coordinator).

There's a potential for stale reads when read-from-replica is enabled -- but that's inherent in any read from replica usecase....

@soloestoy

Copy link
Copy Markdown
Contributor

Flotilla orchestrates all slot changes through the Topology Director (the admin never needs to communicate directly with the data nodes)

I strongly agree with this approach, and I also want to point out that in this case, the Topology Director is the real administrator, as it is doing the job of an administrator.

@soloestoy

Copy link
Copy Markdown
Contributor

There's a potential for stale reads when read-from-replica is enabled -- but that's inherent in any read from replica usecase....

As far as I know, in areas outside of Redis, there are many systems that can achieve consistent reads on replica databases, such as Polardb. However, this is another interesting topic.

@zuiderkwast

Copy link
Copy Markdown
Contributor

the Topology Director is the real administrator, as it is doing the job of an administrator.

That's very bad analogy IMO.

The Topology Director is a raft cluster, which achieves consensus among shards. The consistency between master and replicas is handled by Failover Coordinator, which is another raft cluster per shard. These are part of the cluster, not admins.

An administrator is a user, typically a human, orchestrating the whole thing. An admin should not be able to induce inconsistent slot ownerships by sending SETSLOT differently to different nodes.

@soloestoy

Copy link
Copy Markdown
Contributor

I don't think so, an administrator is not a specific person, it is a role, as you said, responsible for resource orchestration, metadata management, and other tasks.

@ushachar

ushachar commented Jul 5, 2023

Copy link
Copy Markdown
Contributor

@soloestoy Sure -- I meant in the Redis replica sense.... Not as a global statement for all dist systems :)
FWIW I'd argue that a Paxos learner (from what I see PolarDB uses that for read-only members) isn't really a 'replica'....

@PingXie

PingXie commented Nov 27, 2023

Copy link
Copy Markdown
Contributor Author

Hi @soloestoy and team, revisiting our conversation on this PR, I've been considering the cluster bus method. My preference for the replication approach primarily stems from how the WAIT command for replica acknowledgment naturally aligns with it. The proven functionality of this approach and the extensive time invested in its development are also key factors. Moving to the cluster bus method would mean developing a new mechanism for acknowledgment. While there are often multiple viable solutions to a problem, I think it is crucial to understand the concrete advantages of each. @soloestoy, your insights on its benefits would be invaluable.

@oranagra @zuiderkwast @madolson

@zuiderkwast

Copy link
Copy Markdown
Contributor

I'm still waiting for a merge. :) It's a good improvement.

A bonus point for making replicas aware of migrations: It opens for replicas to return -ASK redirects. Currently, they don't know about ongoing migration so they just return null for already migrated keys.

@PingXie

PingXie commented Dec 11, 2023

Copy link
Copy Markdown
Contributor Author

Hey team, can we bring this PR to the table at your next core team meeting? It feels like the ideal time to reach a decision and keep things moving. I'm ready to walk you through my thought process and am keen to hear your perspectives.

@PingXie

PingXie commented Mar 2, 2024

Copy link
Copy Markdown
Contributor Author

all tests have passed. This PR is ready for merge.

@PingXie PingXie closed this by deleting the head repository Mar 20, 2024
@sundb sundb removed this from Redis 8.2 Aug 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

state:needs-review the PR requires additional review

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

7 participants