RFC: (first exploration of) upgrades/migrations#5985
Conversation
|
via @spencerkimball: Wow, this is a hairy problem indeed. Here's a more radical option which occurs to me:
|
|
Replying to @spencerkimball's idea first because it's shorter: I really like this idea in principle; wherever we can move processing from post-raft to pre-raft it's a big win for safety and future-proofing. However, there's a big problem with pipelining: when does the leader apply the changes to its local state? It can't apply them immediately because there is no guarantee that they will be committed by raft. But if it doesn't apply them immediately then we have to be very careful about dependencies between commands. In the command queue, reads don't block reads and writes don't block writes, but reads and writes block each other. We currently classify read/write operations like CPut as writes and propose them directly to raft; the raft commit process ensures that everything is processed in the proper order and each CPut sees exactly the writes that were committed before it. If we move the CPut conditional processing to before raft, then we'll need to treat that command as both a read and a write for purposes of the command queue, so it will not be parallelizable with anything. I expect that distributed SQL will introduce more of these kinds of CPut-like hybrid commands (or DeleteRange-like requests that have a high ratio of KV changes to command size). Even "write-only" commands like Put could be effected since they need to read for any pending intents, so I'm not sure if we'll be able to propose anything while a previous command is outstanding on the same key. This will severely limit our single-key throughput. |
|
@bdarnell writes actually do block on preceding overlapping writes in the command queue. Looking back at that change (bdefc85), I believe the original impetus is no longer as strong as it was (though relaxing the write/write blocking in the command queue is still tricky). However, perhaps it makes sense to leave this in place and reap both potentially large performance gains and sanity in the upgrade / downgrade migration story. |
|
@spencerkimball I stand corrected. A recent discussion about write performance led me to think we had changed this. I think it's probably worthwhile to figure out how to eliminate write/write blocking. That's a big performance limitation and affects our "speed of light" numbers by delaying the start of raft operations. I think it may be responsible for the poor performance noted in #5981, for example. The performance gains (in reduced CPU usage and read bandwidth) of this proposal seem like a poor tradeoff to me if it prevents us from eliminating write/write blocking. It may be worthwhile anyway by making upgrades more tractable, but I'm not convinced it's a long-term performance win. Review status: 0 of 1 files reviewed at latest revision, all discussions resolved. Comments from Reviewable |
|
@bdarnell not just upgrades, btw. Downgrades too, which seem intractable otherwise (short of just dumping your data and reloading it in bulk of course). Yes a busy succession of writes to the same key are going to be expensive. That said, I'm not convinced it's a net performance loss. I think there's also a problem with raft replays. There's no way to know we have them with this proposal since it's just a blind application of a write batch. |
|
Now on to @tschottdorf 's proposal. This generally looks good; I think the Review status: 0 of 1 files reviewed at latest revision, 18 unresolved discussions. docs/RFCS/version_upgrades.md, line 15 [r1] (raw file): There are many levels at which migration frameworks will be needed; this is just one of them (although it's probably the trickiest one) docs/RFCS/version_upgrades.md, line 23 [r1] (raw file): The changes that this affects are not part of raft per se, but in the things that make up the raft "state machine" (primarily docs/RFCS/version_upgrades.md, line 86 [r1] (raw file): docs/RFCS/version_upgrades.md, line 93 [r1] (raw file): I was going to suggest that the version be stored in the range descriptor, but it's probably better to keep it in a separate key so that Bu can't docs/RFCS/version_upgrades.md, line 112 [r1] (raw file): docs/RFCS/version_upgrades.md, line 122 [r1] (raw file): One advantage to making the upgrade use an explicit command like docs/RFCS/version_upgrades.md, line 136 [r1] (raw file): docs/RFCS/version_upgrades.md, line 140 [r1] (raw file): docs/RFCS/version_upgrades.md, line 160 [r1] (raw file): docs/RFCS/version_upgrades.md, line 172 [r1] (raw file): docs/RFCS/version_upgrades.md, line 182 [r1] (raw file): docs/RFCS/version_upgrades.md, line 190 [r1] (raw file): docs/RFCS/version_upgrades.md, line 198 [r1] (raw file): We may need to move away from storing protobuf objects in docs/RFCS/version_upgrades.md, line 203 [r1] (raw file): docs/RFCS/version_upgrades.md, line 230 [r1] (raw file): docs/RFCS/version_upgrades.md, line 255 [r1] (raw file): This seems like a lot of machinery for something that's doomed to be replaced in the relatively near future. For a dramatic enough change it might be worth building this out (e.g. for Spencer's proposal to move most processing from post-raft to pre-raft), but for raft command migrations I would strongly prefer bypassing the stop-the-world phase and going straight to online migrations (for migrations at other levels stop-the-world could make more sense). docs/RFCS/version_upgrades.md, line 283 [r1] (raw file): docs/RFCS/version_upgrades.md, line 287 [r1] (raw file):
Comments from Reviewable |
|
@spencerkimball Once we have #5973 to make the lease holder and raft leader the same, we have new options for dealing with raft replays. Our current replay situation exists in part because the follower may need to replay commands to the leader; when that is no longer common we can include term number and log position in the proposed command and make it a noop if it manages to commit at an unexpected place in the log. Review status: 0 of 1 files reviewed at latest revision, 18 unresolved discussions. Comments from Reviewable |
|
Circling back to @spencerkimball's suggestion to hoist Raft command execution to the layer above Raft. It's going to solve a whole lot of problems if we can pull it of. Here's some more half-baked rambling: To avoid read-write issues, the approach that seems "correct" to me is the following:
Example:
Obvious con: Raft does not guarantee things commit in the order in which we propose it. We must work around that somehow. I'm sure there's a whole lot more complication in the weeds. But that does seem like the naive starting point, so wanted to write it down. The upshot is that nobody waits for nobody if Raft does not reorder log entries (which it hopefully shouldn't under normal circumstances). Error handling has some caveats as well (a |
|
@tschottdorf the command queue, as currently constituted, obviates the need to successively build on the uncommitted (in Raft) writes of previous commands. We don't necessarily need to solve the raft-commit-wait on write/write conflicts as part of hoisting the execution of commands above raft. I realize you're just recording ideas, but want to make this point for the record. |
|
That's a good point, I was brainstorming on the "speed of light" comment -- Tobias |
|
Requiring precise byte compatibility of the implementations of the KV operations "below" raft across software versions is overly burdensome. We would need a significant refactoring and large amounts of testing infrastructure to make that approach tractable and even then I worry that it would be fragile and severely restrict our ability to change our KV operations. Recent history shows that there is both ongoing development of the KV operation implementations as well as interest in adding new functionality (Spencer's one-phase-commit optimization, the additional boolean to CPut, the leader-lease max-offset field, etc). Also of note is that the |
|
Yeah, it's become clear that it's untenable to maintain compatibility with the large volume of code downstream of raft, so I think we have no choice but to do something like this to reduce the scope of our most stringent compatibility requirements. I'm concerned about the fact that we still have write/write blocking but at least this change wouldn't make things any worse than they already are and we should be able to find ways to address write/write blocking even with all the logic moved pre-raft. We'll still need some small parts of this RFC, mainly the ChangeVersion (or Freeze/Unfreeze) RPCs: we'll need them at least to switch from the old style to the new one, and for any future changes to the new write-batch interface. But hopefully those changes will be rare enough that we won't have a burdensome amount of code duplication. |
|
Looks like we're approaching a consensus here. Who would be best to fold Spencer's proposal into this RFC? @tschottdorf? |
|
Pretty sure we'll need some input from all of the participants here to hash On Sat, Apr 16, 2016 at 10:45 AM Peter Mattis notifications@github.com
-- Tobias |
|
I'll have something posted tomorrow. On Sat, Apr 16, 2016, 20:30 Tobias Schottdorf tobias@cockroachlabs.com
-- Tobias |
|
What do we do with this? Merge as rejected, or close? |
|
Merge as rejected. It's good to capture this for the record in a more discoverable way than an old closed PR. |
|
Ditto what @bdarnell said. |
a12cfbe to
c3e9300
Compare
|
Ok, SGTM. |
c3e9300 to
7e09e12
Compare
7e09e12 to
dfbd0fb
Compare
[ci skip]
This change is