Skip to content

[RFC] Proposer-evaluted KV#6166

Merged
tbg merged 5 commits intomasterfrom
tschottdorf/leader-evaluated-raft
Apr 25, 2016
Merged

[RFC] Proposer-evaluted KV#6166
tbg merged 5 commits intomasterfrom
tschottdorf/leader-evaluated-raft

Conversation

@tbg
Copy link
Copy Markdown
Member

@tbg tbg commented Apr 19, 2016

@bdarnell
Copy link
Copy Markdown
Contributor

Review status: 0 of 1 files reviewed at latest revision, 10 unresolved discussions.


docs/RFCS/leader_evaluated_raft.md, line 95 [r1] (raw file):
This still seems like a lot of logic to have post-raft. Would the recent lease changes have been safe in this model, or would we have needed an approach more like the one in #5985 for that one?

My initial reaction on seeing this list was that this is exactly the stuff that we may need to change and we haven't gained much if it's just a way to make changes to the relatively simple commands that aren't on this list. But after looking at the list of migrations considered in #5985 it does seem to cover most of them. And even for these commands the bulk of the work could still be done with the mvcc batch. For instance, only the call to Store.SplitRange needs to be covered here; the rest of the splitTrigger can be run pre-raft. And the portions of these commands that remain post-raft may mainly operate on things that are outside the raft "state machine" so compatibility requirements may not be as strict.


docs/RFCS/leader_evaluated_raft.md, line 96 [r1] (raw file):
What would be done here for the abort cache? Isn't it just KV pairs with no in-memory state or side effects?


docs/RFCS/leader_evaluated_raft.md, line 178 [r1] (raw file):
We had a pure go batch implementation prior to 7d4e2fc. Would it be better to revive that than to wrap a batch with something that saves a second copy of all writes?


docs/RFCS/leader_evaluated_raft.md, line 208 [r1] (raw file):
Some more detail: The RaftCmd would also include a term and log index, to be populated with the leader's latest log information when it is proposed. The writes would only be applied if it committed at that index; they would be dropped (and reproposed) if they committed at any other index. This would only be a possibility when the lease holder is not the raft leader (and even then it's unlikely in stable lease situations). We may be able to be even stricter about colocating the two roles (in which case we could perhaps remove raft-level replays completely), but I'm wary of forcing too many raft elections.


docs/RFCS/leader_evaluated_raft.md, line 232 [r1] (raw file):
Duplicate heading


docs/RFCS/leader_evaluated_raft.md, line 238 [r1] (raw file):
Downreplicating to a single copy scares me. Partially for the obvious concern that an ill-timed disk failure could lead to data loss, but mainly because that operation is not well tested and is likely to encounter edge cases around quorums. A single-replica configuration in a multi-node cluster may make the rebalancer do interesting things (e.g. I'm not sure if you can stop it from creating a temporary second replica as it moves things around).


docs/RFCS/leader_evaluated_raft.md, line 251 [r1] (raw file):
We'll eventually need something like the scheme in #5985 for migrations of what remains post-raft. So anything done in that direction, while complex, would not be wasted work. In particular, the Freeze/Unfreeze pair of RPCs I proposed would make it possible to make this switch without downreplicating to a single copy.


docs/RFCS/leader_evaluated_raft.md, line 254 [r1] (raw file):
More drawbacks:

  • While the amount of code downstream of raft is drastically reduced, it is not zero, so we will still need something like RFC: (first exploration of) upgrades/migrations #5985 eventually. But hopefully this will not be something we need to use often.
  • "Dumb" raft commands generally need to be strictly serialized by the command queue, while "smart" ones can be pipelined more effectively. We currently enforce serialization anyway, but this change removes the option of pipelining.

docs/RFCS/leader_evaluated_raft.md, line 256 [r1] (raw file):
DeleteRange is an extreme case, but I would guess that most commands will have some inflation here (would be good to measure with a quick prototype). We're saving the CPU work and disk reads by not repeating those operations on each replica, but we'll probably be increasing network traffic and disk writes.

Splits will also be large if the abort cache is copied into KV Write pairs.


docs/RFCS/leader_evaluated_raft.md, line 260 [r1] (raw file):
s/and no//


Comments from Reviewable

@tbg tbg force-pushed the tschottdorf/leader-evaluated-raft branch from fca7031 to 41ce979 Compare April 20, 2016 07:51
@tbg
Copy link
Copy Markdown
Member Author

tbg commented Apr 20, 2016

Added a second commit to address @bdarnell's initial round of comments.


Review status: 0 of 1 files reviewed at latest revision, 10 unresolved discussions, some commit checks pending.


docs/RFCS/leader_evaluated_raft.md, line 95 [r1] (raw file):

And the portions of these commands that remain post-raft may mainly operate on things that are outside the raft "state machine" so compatibility requirements may not be as strict.

Yes, exactly. LeaderLease is in this list only because we have in-memory state on *Replica which we need to change (and there's this bit about gossipping some stuff if the new leader changes and updating the timestamp cache, etc). It's not the intention that all of "(*Replica).LeaderLeasemove, though we will want theLease` proto in this trigger, so its migration would be a little harder potentially.

I added more detail on the individual triggers in the mock proto below.


docs/RFCS/leader_evaluated_raft.md, line 96 [r1] (raw file):
You're right, nothing to do there. Removed.


docs/RFCS/leader_evaluated_raft.md, line 178 [r1] (raw file):
Added a discussion of this to the Unresolved questions section. The best implementation would be RocksDB exposing a read-only (zero-copy) view of queued mutations; everything else we'd have to benchmark.


docs/RFCS/leader_evaluated_raft.md, line 208 [r1] (raw file):
Inserted.


docs/RFCS/leader_evaluated_raft.md, line 232 [r1] (raw file):
did s/Implement/Migr/


docs/RFCS/leader_evaluated_raft.md, line 251 [r1] (raw file):
Ah, {F,Unf}reeze does make a lot of sense here; it had dropped off my radar. Updated this section.


docs/RFCS/leader_evaluated_raft.md, line 254 [r1] (raw file):
Added. Can you elaborate on not being able to pipeline (ideally with a simple example)? I wasn't quite clear on the discussion in #5985 - it seemed to me that it would be as difficult then as it is now, so I'm likely missing the big picture.
The idea would be that we would want to shove a bunch of overlapping writes into Raft simultaneously, correct? Since we can avoid Raft reordering now (or at least with changes proposed in this RFC), the problem boils down to having all intermediate states which would result from applying the prefixes of the existing overlapping writes (including side effects) available to execute on top of. This RFC appears to operate completely below that layer.


docs/RFCS/leader_evaluated_raft.md, line 256 [r1] (raw file):
Added. The abortCache is (or at least should be) pretty small these days.


docs/RFCS/leader_evaluated_raft.md, line 260 [r1] (raw file):
Done. (-ish)


Comments from Reviewable

@tbg tbg force-pushed the tschottdorf/leader-evaluated-raft branch from 41ce979 to ff7bb96 Compare April 20, 2016 11:58
@petermattis
Copy link
Copy Markdown
Collaborator

Review status: 0 of 1 files reviewed at latest revision, 11 unresolved discussions, all commit checks successful.


docs/RFCS/leader_evaluated_raft.md, line 178 [r1] (raw file):
There is a significant stability and performance risk in moving away from the RocksDB WriteBatch as we would have to rewrite the "merged" iterator used for batches and the existing WriteBatch code has been tuned while a new implementation would not.

Underneath the hood, a RocksDB WriteBatch stores all of the writes in a C++ string (or the equivalent). This string is persisted to the RocksDB WAL when the WriteBatch is written. And note that all mutating operations in RocksDB are internally converted into WriteBatches. This means that the format of the internal WriteBatch buffer is stable.

I see two reasonable options here. One is that we simply pass around the internal WriteBatch buffer. RocksDB already has an interface for retrieving this internal data and adding an interface to create a WriteBatch from such a buffer would be straightforward.

The second option would be to iterate over the WriteBatch mutations and build up our own set of mutations to pass over the wire. This would be slightly less efficient, but would divorce us from RocksDB.

My preference is for the first option (using the internal WriteBatch::Data()). We can note the format of this data in the RaftCmd to allow for future versioning (e.g. if we ever move away from RocksDB). The format of this internal data is reasonably simple to parse if we ever need to do so.


docs/RFCS/leader_evaluated_raft.md, line 361 [r4] (raw file):
There is a significant stability and performance risk in moving away from the RocksDB WriteBatch as we would have to rewrite the "merged" iterator used for batches and the existing WriteBatch code has been tuned while a new implementation would not.

Underneath the hood, a RocksDB WriteBatch stores all of the writes in a C++ string. This string is persisted to the RocksDB WAL when the WriteBatch is written. Note that all mutating operations in RocksDB are internally converted into WriteBatches (i.e. this code path is heavily optimized and the format is stable).

I see two reasonable options here. One is that we simply pass around the internal WriteBatch buffer. RocksDB already has an interface for retrieving this internal data (WriteBatch::Data()) and adding an interface to create a WriteBatch from such a buffer would be straightforward.

The second option would be to iterate over the WriteBatch mutations and build up our own set of mutations to pass over the wire. This would be slightly less efficient, but would divorce us from RocksDB.

My preference is for the first option (using the internal WriteBatch::Data()). We can note the format of this data in the RaftCmd to allow for future versioning (e.g. if we ever move away from RocksDB). The format of this internal data is reasonably simple to parse if we ever need to do so.


Comments from Reviewable

@petermattis
Copy link
Copy Markdown
Collaborator

Review status: 0 of 1 files reviewed at latest revision, 11 unresolved discussions, all commit checks successful.


docs/RFCS/leader_evaluated_raft.md, line 361 [r4] (raw file):
Ah, I see WriteBatch::WriteBatch(const std::string& rep) already exists. There are no changes to RocksDB necessary to pass around the internal WriteBatch buffer and reconstitute it into a WriteBatch on the receiving side.


Comments from Reviewable

@petermattis
Copy link
Copy Markdown
Collaborator

Review status: 0 of 1 files reviewed at latest revision, 11 unresolved discussions, all commit checks successful.


docs/RFCS/leader_evaluated_raft.md, line 178 [r1] (raw file):
This comment was accidentally duplicated here. Ignore and follow up on the version below.


Comments from Reviewable

@tamird
Copy link
Copy Markdown
Contributor

tamird commented Apr 20, 2016

Reviewed 1 of 1 files at r4.
Review status: all files reviewed at latest revision, 12 unresolved discussions, all commit checks successful.


docs/RFCS/leader_evaluated_raft.md, line 280 [r4] (raw file):
s/a a/a/


Comments from Reviewable

@bdarnell
Copy link
Copy Markdown
Contributor

Review status: all files reviewed at latest revision, 7 unresolved discussions, all commit checks successful.


docs/RFCS/leader_evaluated_raft.md, line 254 [r1] (raw file):
Ah yes, the pipelining problems have to do with raft-level reordering so as long as we can solve that one then we can pipeline just as easily (or not) as we could before.


docs/RFCS/leader_evaluated_raft.md, line 361 [r4] (raw file):
If we can get the WriteBatch data in and out of rocksdb then that sounds ideal. My biggest concern would be about the compatibility guarantees that rocksdb makes (or doesn't) about this format. If we use this then I think we should write and test a parser even if we don't use it in production, so we will notice if rocksdb ever changes their format (if they change then we'll need to convert their new format to the old one for compatibility).


Comments from Reviewable

@tbg tbg force-pushed the tschottdorf/leader-evaluated-raft branch from ff7bb96 to 460f9c2 Compare April 21, 2016 12:53
@tbg
Copy link
Copy Markdown
Member Author

tbg commented Apr 21, 2016

Updated, PTAL. I feel that this PR has split up relatively well. We have (depending on how you count) 7 well-contained prerequisites which could in theory all be worked on in parallel (the one needing most thought likely being the Freeze/Unfreeze part, which is queued first) and which would (mostly) be useful independently. The remaining "forklift" portion which actually moves the bulk processing pre-Raft is still going to take some effort, but it does appear to me that the bulk of the work is in those prerequisites.


Review status: 0 of 1 files reviewed at latest revision, 6 unresolved discussions.


docs/RFCS/leader_evaluated_raft.md, line 254 [r1] (raw file):
Updated the discussion.


docs/RFCS/leader_evaluated_raft.md, line 280 [r4] (raw file):
Done.


docs/RFCS/leader_evaluated_raft.md, line 361 [r4] (raw file):
Here's the format (from c-rocksdb/internal/db/write_batch.cc; note that this is copied from a comment, so it could already have rotted):

// WriteBatch::rep_ :=
//    sequence: fixed64
//    count: fixed32
//    data: record[count]
// record :=
//    kTypeValue varstring varstring
//    kTypeDeletion varstring
//    kTypeSingleDeletion varstring
//    kTypeMerge varstring varstring
//    kTypeColumnFamilyValue varint32 varstring varstring
//    kTypeColumnFamilyDeletion varint32 varstring varstring
//    kTypeColumnFamilySingleDeletion varint32 varstring varstring
//    kTypeColumnFamilyMerge varint32 varstring varstring
// varstring :=
//    len: varint32
//    data: uint8[len]

For reference, LevelDB has the following:

// WriteBatch::rep_ :=
//    sequence: fixed64
//    count: fixed32
//    data: record[count]
// record :=
//    kTypeValue varstring varstring         |
//    kTypeDeletion varstring
// varstring :=
//    len: varint32
//    data: uint8[len]

Doesn't seem unlikely that RocksDB will diverge more as they add features, so I agree with writing the parser (also to not make it even harder to switch to another storage engine in the future).

I updated the RFC to reflect that this is the suggested implementation.


Comments from Reviewable

@petermattis
Copy link
Copy Markdown
Collaborator

Review status: 0 of 1 files reviewed at latest revision, 6 unresolved discussions.


docs/RFCS/leader_evaluated_raft.md, line 361 [r4] (raw file):
While the RocksDB WriteBatch format might certainly be enhanced in the future, any enhancements are likely to be backward compatible as a non-compatible change would require rewriting the entries in the WAL. I think writing our own parser and verification of this format would be good, though it shouldn't be a blocker/prerequisite for the rest of the work in this RFC.


Comments from Reviewable

@petermattis
Copy link
Copy Markdown
Collaborator

Review status: 0 of 1 files reviewed at latest revision, 7 unresolved discussions.


docs/RFCS/leader_evaluated_raft.md, line 219 [r7] (raw file):
There is another nice effect from using the WriteBatch: instead of marshaling and unmarshaling individual keys and values, we'll be passing around a blob. While investigating batch insert performance I added some instrumentation around RaftCommand marshaling and noticed:

raft command marshal:   20002: 6.2ms
raft command unmarshal: 20002: 29.1ms
raft command marshal:   20002: 6.0ms
raft command unmarshal: 20002: 12.4ms
raft command marshal:   20002: 6.0ms
raft command unmarshal: 20002: 42.6ms

The first number (20002) is the number of requests in the BatchRequest and the second is the time it took to marshal/unmarshal. The marshaled data size is almost exactly 1MB.


Comments from Reviewable

@tbg tbg force-pushed the tschottdorf/leader-evaluated-raft branch from 460f9c2 to 0f1027c Compare April 21, 2016 15:01
@tbg
Copy link
Copy Markdown
Member Author

tbg commented Apr 21, 2016

Review status: 0 of 1 files reviewed at latest revision, 7 unresolved discussions, some commit checks pending.


docs/RFCS/leader_evaluated_raft.md, line 361 [r4] (raw file):
I added that to the implementation strategy.


docs/RFCS/leader_evaluated_raft.md, line 219 [r7] (raw file):
Yes, check out the current revision which points this out as one of the advantages. Good to have these numbers - Wove them into the draft.


Comments from Reviewable

@bdarnell
Copy link
Copy Markdown
Contributor

:lgtm:


Review status: 0 of 1 files reviewed at latest revision, 4 unresolved discussions, some commit checks failed.


docs/RFCS/leader_evaluated_raft.md, line 361 [r4] (raw file):
We can trust rocksdb to be backwards-compatible in any changes they make, but we require bidirectional compatibility. We create a batch on the leader and ship it off to the follower to be applied, and the follower might be running either an older or a newer version of the code. If they add a new record type, we either need to be able to guarantee that the new record types won't be used in our case, or translate them back to something the old version can understand.


docs/RFCS/leader_evaluated_raft.md, line 219 [r7] (raw file):
Unmarshaling is much slower than marshaling because it copies each bytes field into a separately allocated array, while marshaling computes the total size needed and does one big allocation. It might be worth exploring a non-copying option for gogoproto that simply sets the bytes field to a slice from the input data.


Comments from Reviewable

@tbg
Copy link
Copy Markdown
Member Author

tbg commented Apr 21, 2016

Seems that there's a good consensus so far - @spencerkimball and anyone else interested, I'll keep this open until Monday night ET (and longer if there's discussion). PTAL.

Once merged, I'll create a roadmap (issues for each points outlined in the implementation strategy below and the remainder (and perhaps a label)), but we should also determine how long we're hoping to get this done and who will work on it.


Review status: 0 of 1 files reviewed at latest revision, 4 unresolved discussions, some commit checks pending.


docs/RFCS/leader_evaluated_raft.md, line 361 [r4] (raw file):
Added.


docs/RFCS/leader_evaluated_raft.md, line 219 [r7] (raw file):
Good idea, but doesn't fit here - is there any issue to add this to, or could you file it otherwise?


Comments from Reviewable

@bdarnell
Copy link
Copy Markdown
Contributor

Review status: 0 of 1 files reviewed at latest revision, 4 unresolved discussions, some commit checks failed.


docs/RFCS/leader_evaluated_raft.md, line 219 [r7] (raw file):
#6208


Comments from Reviewable

@JackKrupansky
Copy link
Copy Markdown

JackKrupansky commented Apr 21, 2016

Is the presumption here that an upgrade/migration of a node must be performed while a transaction is in progress?

I mean, why not use the following model:

  1. Mark the node as frozen. No new transactions involving ranges replicated to the node can begin.
  2. Let current transactions run to completion.
  3. If step 2 takes too long, have the option of aborting any remaining transactions. No new transactions would be started since the node was frozen.
  4. Upgrade/migrate the node, restart.
  5. Unfreeze the node.

BTW, with Cassandra, if the application is reading and writing with quorum consistency, any single node could be brought down without breaking quorum, assuring absolute non-stop operation. The node would then catch up due to hinted hand-off, provided that the down time was small, under an hour or whatever the handoff timeout grace period is.

It would be nice if there was an application return code indicating that the server is in maintenance mode and that the application should hold off new transactions for some interval of time, rather than the application simply getting some error and mindlessly retrying the operation or assuming that the server is in a bad state rather than simply being maintained.

@tbg
Copy link
Copy Markdown
Member Author

tbg commented Apr 22, 2016

No, this happens below the level of transactions. The primary source of complication is that we keep byte-wise consistent replicas (a problem Cassandra does not have) and that the post-raft evaluation of commands makes the bulk of code live downstream of the point at which byte-wise output must be identical.
The goal for (most) migrations is a rolling cluster restart (which with appropriately configured clients) is completely interruption free.


Review status: 0 of 1 files reviewed at latest revision, 3 unresolved discussions, some commit checks failed.


Comments from Reviewable

tbg added 5 commits April 25, 2016 07:40
This is a hastily written, initial draft released early to get early feedback
for more refined iterations.

[ci skip]
Misc edits and flesh out

- remaining post-Raft code
- replay protection
- migration strategy
- drawbacks
- unresolved questions
Flesh out

- use of RocksDB WriteBatch for zero-copy serialization
- concretize implementation strategy
- updated discussion on removal of write-write blocking

[ci skip]

* any key-value writes: For example the outcome of a `Put`, `DeleteRange`, `CPut`, `BeginTransaction`, or `LeaderLease`.
Some alternatives exist: writes could be encoded as `[]MVCCKeyValue`, or even
as post-MVCC raw (i.e. opaque) key-value pairs. We'll assume the former for
Copy link
Copy Markdown
Member

@spencerkimball spencerkimball Apr 27, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why assume []MVCCKeyValue for the "effects" contents? Then we're tying up potential for migration in how we encode MVCC information. I think it's more prudent to send "post-MVCC raw" key-value pairs.

@tbg
Copy link
Copy Markdown
Member Author

tbg commented Apr 28, 2016

Any follow-up discussions are best held in the appropriate subissue: https://github.com/cockroachdb/cockroach/labels/leader-proposed-raft


Review status: 0 of 1 files reviewed at latest revision, 9 unresolved discussions, all commit checks successful.


docs/RFCS/leader_evaluated_raft.md, line 71 [r17] (raw file):
The above is an option. The later discussion settles on WriteBatch as elaborated on later (which is a post-MVCC raw option).


docs/RFCS/leader_evaluated_raft.md, line 98 [r17] (raw file):
We have way more side effects and the triggers that we have make nontrivial use of data which may no more be available (for example, you won't know the txn from the WriteBatch). I don't think we can really see what we need a priori, got to do the refactor first. I'm sure we'll see new protos here.


docs/RFCS/leader_evaluated_raft.md, line 175 [r17] (raw file):
Note to self: fix this typo when adjusting the status of this RFC.


docs/RFCS/leader_evaluated_raft.md, line 177 [r17] (raw file):
Good point, made a comment on the issue.


docs/RFCS/leader_evaluated_raft.md, line 274 [r17] (raw file):
Added to #6287.


Comments from Reviewable

@vivekmenezes
Copy link
Copy Markdown
Contributor

Review status: 0 of 1 files reviewed at latest revision, 10 unresolved discussions, all commit checks successful.


docs/RFCS/leader_evaluated_raft.md, line 35 [r17] (raw file):
I think this a big deal for folks who are running large clusters where a change like this can save thousands of machines


Comments from Reviewable

@tbg tbg changed the title [RFC] Leader-evaluted Raft [RFC] Proposer-evaluted KV Oct 25, 2016
tbg added a commit to tbg/cockroach that referenced this pull request Oct 25, 2016
Remove the `storagebase.RaftCommand` proto by moving its fields onto
`ReplicatedProposalData` while preserving the tag numbers. Use that message
instead of `RaftCommand` throughout, including on the wire (since the tag
numbers were preserved, this does not require any special handling).

This in preparation for a follow-up change which adds an experimental switch to
use proposer-evaluated KV (cockroachdb#6166).
tbg added a commit to tbg/cockroach that referenced this pull request Oct 26, 2016
Remove the `storagebase.RaftCommand` proto by moving its fields onto
`ReplicatedProposalData` while preserving the tag numbers. Use that message
instead of `RaftCommand` throughout, including on the wire (since the tag
numbers were preserved, this does not require any special handling).

This in preparation for a follow-up change which adds an experimental switch to
use proposer-evaluated KV (cockroachdb#6166).
tbg added a commit to tbg/cockroach that referenced this pull request Oct 31, 2016
Remove the `storagebase.RaftCommand` proto by moving its fields onto
`ReplicatedProposalData` while preserving the tag numbers. Use that message
instead of `RaftCommand` throughout, including on the wire (since the tag
numbers were preserved, this does not require any special handling).

This in preparation for a follow-up change which adds an experimental switch to
use proposer-evaluated KV (cockroachdb#6166).
tbg added a commit to tbg/cockroach that referenced this pull request Oct 31, 2016
Remove the `storagebase.RaftCommand` proto by moving its fields onto
`ReplicatedProposalData` while preserving the tag numbers. Use that message
instead of `RaftCommand` throughout, including on the wire (since the tag
numbers were preserved, this does not require any special handling).

This in preparation for a follow-up change which adds an experimental switch to
use proposer-evaluated KV (cockroachdb#6166).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants