RFC: (first exploration of) upgrades/migrations by tbg · Pull Request #5985 · cockroachdb/cockroach

tbg · 2016-04-11T16:38:05Z

[ci skip]

This change is

tbg · 2016-04-11T16:38:37Z

Wow, this is a hairy problem indeed. Here's a more radical option which occurs to me:
Execute commands in a batch engine on the leader before proposing to raft. Then send extremely simple KV ops (a slice of put & delete ops) as the raft command.

Pro: This eliminates the problems associated with versioning causing replica drift, as the raft commands themselves are applied by a single participant (and therefore a single version), and what happens when raft applies commands no longer involves any reading or conditional decision making.
Pro: This also will reduce CPU usage and read IOs dramatically across a cluster. I expect we'll see performance improvements > 1.5x with triplication (though admittedly this is just conjecture based on CPU usage when running three servers on my laptop).
Pro: This opens the door to us storing the data from a raft command just once when we write it to the raft log instead of once to the raft log and then again when applying the raft command to the FSM. Details are hazy here still in my mind, however, so take this suggestion with a grain of salt.
Con: DeleteRange will potentially create very large raft commands.

bdarnell · 2016-04-11T18:56:43Z

Replying to @spencerkimball's idea first because it's shorter:

I really like this idea in principle; wherever we can move processing from post-raft to pre-raft it's a big win for safety and future-proofing. However, there's a big problem with pipelining: when does the leader apply the changes to its local state? It can't apply them immediately because there is no guarantee that they will be committed by raft. But if it doesn't apply them immediately then we have to be very careful about dependencies between commands.

In the command queue, reads don't block reads and writes don't block writes, but reads and writes block each other. We currently classify read/write operations like CPut as writes and propose them directly to raft; the raft commit process ensures that everything is processed in the proper order and each CPut sees exactly the writes that were committed before it. If we move the CPut conditional processing to before raft, then we'll need to treat that command as both a read and a write for purposes of the command queue, so it will not be parallelizable with anything. I expect that distributed SQL will introduce more of these kinds of CPut-like hybrid commands (or DeleteRange-like requests that have a high ratio of KV changes to command size).

Even "write-only" commands like Put could be effected since they need to read for any pending intents, so I'm not sure if we'll be able to propose anything while a previous command is outstanding on the same key. This will severely limit our single-key throughput.

spencerkimball · 2016-04-11T20:28:25Z

@bdarnell writes actually do block on preceding overlapping writes in the command queue. Looking back at that change (bdefc85), I believe the original impetus is no longer as strong as it was (though relaxing the write/write blocking in the command queue is still tricky). However, perhaps it makes sense to leave this in place and reap both potentially large performance gains and sanity in the upgrade / downgrade migration story.

bdarnell · 2016-04-11T21:07:17Z

@spencerkimball I stand corrected. A recent discussion about write performance led me to think we had changed this.

I think it's probably worthwhile to figure out how to eliminate write/write blocking. That's a big performance limitation and affects our "speed of light" numbers by delaying the start of raft operations. I think it may be responsible for the poor performance noted in #5981, for example. The performance gains (in reduced CPU usage and read bandwidth) of this proposal seem like a poor tradeoff to me if it prevents us from eliminating write/write blocking. It may be worthwhile anyway by making upgrades more tractable, but I'm not convinced it's a long-term performance win.

Review status: 0 of 1 files reviewed at latest revision, all discussions resolved.

Comments from Reviewable

spencerkimball · 2016-04-11T21:51:10Z

@bdarnell not just upgrades, btw. Downgrades too, which seem intractable otherwise (short of just dumping your data and reloading it in bulk of course).

Yes a busy succession of writes to the same key are going to be expensive. That said, I'm not convinced it's a net performance loss.

I think there's also a problem with raft replays. There's no way to know we have them with this proposal since it's just a blind application of a write batch.

bdarnell · 2016-04-11T23:42:39Z

Now on to @tschottdorf 's proposal. This generally looks good; I think the ChangeVersion RPC is probably the only option for online updates to raft commands. We'll need something like that whether we go with @spencerkimball 's proposal or not (the only difference is that with that proposal we'd be using ChangeVersion a lot less often). I'm not sure whether we need to embrace the wholesale duplication at this point or if we can try to scrape by with a lot of ifs in the code until things reach a breaking point.

Review status: 0 of 1 files reviewed at latest revision, 18 unresolved discussions.

docs/RFCS/version_upgrades.md, line 15 [r1] (raw file):
s/migrations/raft command migrations/

There are many levels at which migration frameworks will be needed; this is just one of them (although it's probably the trickiest one)

docs/RFCS/version_upgrades.md, line 23 [r1] (raw file):
s/Raft/the application of raft commands/

The changes that this affects are not part of raft per se, but in the things that make up the raft "state machine" (primarily replica_command.go and mvcc.go and their dependencies).

docs/RFCS/version_upgrades.md, line 86 [r1] (raw file):
This map should contain more than commands: functions like MVCCPut belong there as well, and commands like Put should go through the current ReplicaMethods instead of calling it directly. This will make it possible to update those methods as well.

docs/RFCS/version_upgrades.md, line 93 [r1] (raw file):
And propagated across splits, and added to the ever-growing list of problems we'll have to address before we can implement merges.

I was going to suggest that the version be stored in the range descriptor, but it's probably better to keep it in a separate key so that ChangeVersions itself can be as simple as possible (and thus less likely to need versioning itself).

Bu can't ChangeVersions be versioned in the usual way? I don't think there are any circularity concerns here except for the first time ChangeVersions is introduced (and that time I think we'll just have to rely on stop-the-world with no downgrades possible).

docs/RFCS/version_upgrades.md, line 112 [r1] (raw file):
Nothing in this replica can be processed any more, but other replicas on the store are still valid.

docs/RFCS/version_upgrades.md, line 122 [r1] (raw file):
Downgrades imply an additional requirement: new versions must continue to produce output that can be interpreted for older versions. That's trivial for purely additive changes, but if we e.g. delete a field we may have to keep populating it for older versions. We'll need policies for how far back downgrades are allowed (probably more restrictive than our upgrade policy: you can go from 1.0 to 1.2, but maybe once you've gone to 1.2 you can only downgrade to 1.1).

One advantage to making the upgrade use an explicit command like cockroach update latest is that it lets us stage multiple command versions in one cockroach version. For example, suppose a new release includes command versions 125 and 126. 125 includes backward-compatibility code to enable downgrades to version 124, and version 126 just strips that logic out (removing the ability to downgrade past 125). This would accelerate the rate at which we're able to remove code that's just there to support downgrades (and would otherwise need to stick around for a full upgrade cycle)

docs/RFCS/version_upgrades.md, line 136 [r1] (raw file):
We'll also need to make sure that a versioned method never calls an unversioned method (except for things like logging that do not affect the output).

docs/RFCS/version_upgrades.md, line 140 [r1] (raw file):
We should also have a regression test suite with known input/output data to detect cases in which an old version of a command inadvertently changes.

docs/RFCS/version_upgrades.md, line 160 [r1] (raw file):
I think it would take more than "a couple of times" for small changes like this to make the code unreadable. We'd get there eventually to be sure, but jumping straight to another copy of the method feels heavy-handed. Maybe it's worth it for the mindless simplicity, though.

docs/RFCS/version_upgrades.md, line 172 [r1] (raw file):
I think the only tractable approach to lint for this is to use a whitelist of things that versioned methods are allowed to touch, instead of a blacklist of things they aren't.

docs/RFCS/version_upgrades.md, line 182 [r1] (raw file):
The easiest way to do this is to duplicate the protos too. replica_command.125.LeaderLease.go would not refer to roachpb.Lease but to roachpb.Lease_125

docs/RFCS/version_upgrades.md, line 190 [r1] (raw file):
Adding a new nullable (or repeated) field is fine as long as it's not used until a VersionChange; it's only non-nullable fields that have problems with old versions using new marshaling code.

docs/RFCS/version_upgrades.md, line 198 [r1] (raw file):
But these are outside of raft commands so the the more extreme compatibility constraints don't apply. I think that normal protobuf versioning strategies can suffice for the outside-of-raft accesses to this field.

We may need to move away from storing protobuf objects in Replica attributes, and instead copy to/from the protobufs into custom in-memory structs at marshal/unmarshal time. However, changes to the in-memory struct could require historical command versions to be changed, which is scary. It's best if raft commands refer to the on-disk data unmarshaled by the old protos as much as possible instead of the in-memory structs.

docs/RFCS/version_upgrades.md, line 203 [r1] (raw file):
This one seems simple to me: we just insert a dummy value in the command version map that makes the new command a noop for all versions before its introduction. Am I missing something?

docs/RFCS/version_upgrades.md, line 230 [r1] (raw file):
There's nothing magical about there being two versions: if we can automate two versions we could automate N, and it just costs us some binary size. If a wholesale copy of files/packages would be easier to work with then it's worth considering even for multiple versions. I'm not sure how much more we'd be able to automate in this case, though.

docs/RFCS/version_upgrades.md, line 255 [r1] (raw file):
"Won't acknowledge" implies poking around in raft implementation details, and I'm not sure if it's even possible with the raft.Ready interface to only acknowledge part of the logs. It would be safer to make this a pair of Freeze/Unfreeze commands: all commands committed between a freeze and its unfreeze become noops.

This seems like a lot of machinery for something that's doomed to be replaced in the relatively near future. For a dramatic enough change it might be worth building this out (e.g. for Spencer's proposal to move most processing from post-raft to pre-raft), but for raft command migrations I would strongly prefer bypassing the stop-the-world phase and going straight to online migrations (for migrations at other levels stop-the-world could make more sense).

docs/RFCS/version_upgrades.md, line 283 [r1] (raw file):
Rather than disable the consistency checker, we could make the consistency checker smarter and more semantically aware. For example, if it knew which values contained protobufs it could decode the raw proto data and throw out zero values. This wouldn't help with larger migrations, but it could avoid turning what should be a trivial change like adding a proto field into a huge ordeal.

docs/RFCS/version_upgrades.md, line 287 [r1] (raw file):
Two more notable ones:

For online upgrades, the replica allocator needs to be aware of this: nodes should gossip their range of supported range versions and the allocator should not put a range on a node that doesn't support its current version. There are possible races here (what if we're mid-upgrade and the range gets updated just as the allocator is assigning it to an older node). I'm not sure how much we care about this as long as nodes fail reasonably gracefully when given a replica whose version they don't support, but one solution would be to threshold the ChangeVersion process so it doesn't start until a certain percentage of nodes have been upgraded, and once that threshold has been reached the allocator would not assign any new replicas to nodes that are behind the majority.
There are some tricky situations that can arise in developer environments. Suppose you and I are each separately working on a change that requires a raft command migration. If we both pick the next sequential version number for our change, then the one who loses the race to merge will not only have to change the version number used in their branch, they'll also have to wipe any persistent clusters which had applied the previous version number with the now-incorrect meaning. If we coordinated and chose different version numbers, then the one who got the higher value would still have to wipe any persistent clusters when the lower-valued one committed. At Viewfinder our client-side migrations were named rather than numbered for this reason (although server-side migrations were still numbered; we had a lot fewer of them). I'm not sure whether it's worth the complexity to handle this here or if we just assume/mandate that developers will not have persistent clusters running unmerged code.

Comments from Reviewable

bdarnell · 2016-04-11T23:51:30Z

@spencerkimball Once we have #5973 to make the lease holder and raft leader the same, we have new options for dealing with raft replays. Our current replay situation exists in part because the follower may need to replay commands to the leader; when that is no longer common we can include term number and log position in the proposed command and make it a noop if it manages to commit at an unexpected place in the log.

Review status: 0 of 1 files reviewed at latest revision, 18 unresolved discussions.

Comments from Reviewable

tbg · 2016-04-14T12:42:39Z

Circling back to @spencerkimball's suggestion to hoist Raft command execution to the layer above Raft. It's going to solve a whole lot of problems if we can pull it of. Here's some more half-baked rambling: To avoid read-write issues, the approach that seems "correct" to me is the following:

on top of RocksDB, for each in-flight Raft command we keep an in-memory RocksDB (this isn't actually going to work out with an actual in-memory RocksDB, but let's not overcomplicate this right away), which is where a Raft command's output is stored until it gets applied by the state machine (in which case the writes are flushed down to "actual" RocksDB).
Whenever a command is about to be proposed, we compute its writes from the state which would result from applying all previously proposed commands.

Example:

key=A and no outstanding commands (i.e. no in-memory state)
process CPut(key, A->B): reads A, adds in-memory state key=B, proposes Put(key, B)
process CPut(key, B->C)): reads B from in-memory state, adds in-memory state key=C, proposes Put(key, B)
both eventually commit (but must commit in the right order).

Obvious con: Raft does not guarantee things commit in the order in which we propose it. We must work around that somehow. I'm sure there's a whole lot more complication in the weeds. But that does seem like the naive starting point, so wanted to write it down.

The upshot is that nobody waits for nobody if Raft does not reorder log entries (which it hopefully shouldn't under normal circumstances). Error handling has some caveats as well (a ConditionFailedError should not be returned based on the on-memory state because the offending write may not ever commit, but once you have a ConditionFailedError you should rest assured that the value present is not the one you were checking for).

spencerkimball · 2016-04-14T14:11:53Z

@tschottdorf the command queue, as currently constituted, obviates the need to successively build on the uncommitted (in Raft) writes of previous commands. We don't necessarily need to solve the raft-commit-wait on write/write conflicts as part of hoisting the execution of commands above raft. I realize you're just recording ideas, but want to make this point for the record.

tbg · 2016-04-14T14:17:15Z

That's a good point, I was brainstorming on the "speed of light" comment
@bdarnell made earlier.

-- Tobias

petermattis · 2016-04-15T00:46:20Z

Requiring precise byte compatibility of the implementations of the KV operations "below" raft across software versions is overly burdensome. We would need a significant refactoring and large amounts of testing infrastructure to make that approach tractable and even then I worry that it would be fragile and severely restrict our ability to change our KV operations. Recent history shows that there is both ongoing development of the KV operation implementations as well as interest in adding new functionality (Spencer's one-phase-commit optimization, the additional boolean to CPut, the leader-lease max-offset field, etc). Also of note is that the storage/engine.Engine interface to RocksDB has been much more stable. I think it would be worthwhile to fully flesh out Spencer's proposal.

bdarnell · 2016-04-15T20:42:30Z

Yeah, it's become clear that it's untenable to maintain compatibility with the large volume of code downstream of raft, so I think we have no choice but to do something like this to reduce the scope of our most stringent compatibility requirements. I'm concerned about the fact that we still have write/write blocking but at least this change wouldn't make things any worse than they already are and we should be able to find ways to address write/write blocking even with all the logic moved pre-raft.

We'll still need some small parts of this RFC, mainly the ChangeVersion (or Freeze/Unfreeze) RPCs: we'll need them at least to switch from the old style to the new one, and for any future changes to the new write-batch interface. But hopefully those changes will be rare enough that we won't have a burdensome amount of code duplication.

petermattis · 2016-04-16T14:44:49Z

Looks like we're approaching a consensus here. Who would be best to fold Spencer's proposal into this RFC? @tschottdorf?

tbg · 2016-04-16T18:30:16Z

Pretty sure we'll need some input from all of the participants here to hash
this out properly, but I can provide a starting point next week on how the
mechanisms we have in place would be affected by these changes.

On Sat, Apr 16, 2016 at 10:45 AM Peter Mattis notifications@github.com
wrote:

Looks like we're approaching a consensus here. Who would be best to fold
Spencer's proposal into this RFC? @tschottdorf
https://github.com/tschottdorf?

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#5985 (comment)

-- Tobias

tbg · 2016-04-18T14:24:33Z

I'll have something posted tomorrow.

On Sat, Apr 16, 2016, 20:30 Tobias Schottdorf tobias@cockroachlabs.com
wrote:

Pretty sure we'll need some input from all of the participants here to
hash this out properly, but I can provide a starting point next week on how
the mechanisms we have in place would be affected by these changes.

On Sat, Apr 16, 2016 at 10:45 AM Peter Mattis notifications@github.com
wrote:

Looks like we're approaching a consensus here. Who would be best to fold
Spencer's proposal into this RFC? @tschottdorf
https://github.com/tschottdorf?

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#5985 (comment)

-- Tobias

-- Tobias

tbg · 2016-04-26T14:47:41Z

What do we do with this? Merge as rejected, or close?

bdarnell · 2016-04-26T14:48:54Z

Merge as rejected. It's good to capture this for the record in a more discoverable way than an old closed PR.

petermattis · 2016-04-26T14:49:11Z

Ditto what @bdarnell said.

tbg · 2016-04-26T14:54:00Z

Ok, SGTM.

This was referenced Apr 12, 2016

storage: efficient abort cache GC #5882

Merged

Fast path atomic write batches on one phase commit txns #5966

Merged

bdarnell mentioned this pull request Apr 20, 2016

[RFC] Proposer-evaluted KV #6166

Merged

tbg mentioned this pull request Apr 25, 2016

storage: Freeze/Unfreeze stop-the-world Raft migrations #6284

Closed

tbg force-pushed the tschottdorf/upgrade_rfc branch from a12cfbe to c3e9300 Compare April 26, 2016 14:53

tbg force-pushed the tschottdorf/upgrade_rfc branch from c3e9300 to 7e09e12 Compare April 26, 2016 14:59

RFC: (first exploration of) upgrades/migrations

dfbd0fb

tbg force-pushed the tschottdorf/upgrade_rfc branch from 7e09e12 to dfbd0fb Compare April 26, 2016 14:59

tbg merged commit 23217be into master Apr 26, 2016

tbg deleted the tschottdorf/upgrade_rfc branch April 26, 2016 15:15

Conversation

tbg commented Apr 11, 2016

Uh oh!

tbg commented Apr 11, 2016

Uh oh!

bdarnell commented Apr 11, 2016

Uh oh!

spencerkimball commented Apr 11, 2016

Uh oh!

bdarnell commented Apr 11, 2016

Uh oh!

spencerkimball commented Apr 11, 2016

Uh oh!

bdarnell commented Apr 11, 2016

Uh oh!

bdarnell commented Apr 11, 2016

Uh oh!

tbg commented Apr 14, 2016

Uh oh!

spencerkimball commented Apr 14, 2016

Uh oh!

tbg commented Apr 14, 2016

Uh oh!

petermattis commented Apr 15, 2016

Uh oh!

bdarnell commented Apr 15, 2016

Uh oh!

petermattis commented Apr 16, 2016

Uh oh!

tbg commented Apr 16, 2016

Uh oh!

tbg commented Apr 18, 2016

Uh oh!

tbg commented Apr 26, 2016

Uh oh!

bdarnell commented Apr 26, 2016

Uh oh!

petermattis commented Apr 26, 2016

Uh oh!

tbg commented Apr 26, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants