storage: prevent command reproposal with new lease index after application by nvb · Pull Request #39064 · cockroachdb/cockroach

nvb · 2019-07-24T01:50:49Z

Fixes #39018.
Fixes #37810.
May fix other tests.

This commit fixes a bug introduced in e4ce717 which allowed a single Raft proposal to be applied multiple times at multiple applied indexes. The bug was possible if a raft proposal was reproposed twice, once with the same max lease index and once with a new max lease index. Because there are two entries for the same proposal, one necessarily has to have an invalid max lease applied index (see the invariant in

cockroach/pkg/storage/replica_raft.go

Line 1430 in 5cbc4b5

// it again. This ensures that at any time, there is only up to a

) If these two entries happened to land in the same application batch on the leaseholder then the first entry would be rejected and the second would apply. The replicas LeaseAppliedIndex would then be bumped all the way to the max lease index of the entry that applied. The bug occurred when the first entry (who must have hit a proposalIllegalLeaseIndex), called into tryReproposeWithNewLeaseIndex. The ProposalData's MaxLeaseIndex would be equal to the Replica's LeaseAppliedIndex because it would match the index in the successful entry. We would then repropose the proposal with a larger lease applied index. This new entry could then apply and result in duplicate entry application.

Luckily, rangefeed's intent reference counting was sensitive enough that it caught this duplicate entry application and panicked loudly. Other tests might also be failing because of it but might not have as obvious symptoms when they hit the bug.

In addition to this primary bug fix, this commit has a secondary effect of fixing an issue where two entries for the same command could be in the same batch and only one would be linked to its ProposalData struct and be considered locally proposed (see the change in retrieveLocalProposals). I believe that this would prevent the command from being properly acknowledged if the first entry was rejected due to its max lease index and the second entry had a larger max lease index and applied. I think the first entry would have eventually hit the check in tryReproposeWithNewLeaseIndex and observed that the linked ProposalData had a new MaxLeaseIndex so it would have added it back to the proposal map, but then it would have had to wait for refreshProposalsLocked to refresh the proposal, at which point this refresh would have hit a lease index error and would be reproposed at a higher index. Not only could this cause duplicate versions of the same command to apply (described above), but I think this could even loop forever without acknowledging the client. It seems like there should be a way for this to cause #39022, but it doesn't exactly line up.

Again, these kinds of cases will be easier to test once we properly mock out these interfaces in #38954. I'm working on that, but I don't think it should hold up the next alpha (#39036). However, this commit does address a TODO to properly handle errors during tryReproposeWithNewLeaseIndex reproposals and that is properly tested.

My debugging process to track this down was to repeatedly run a set of 10 cdc/ledger/rangefeed=true roachtests after reducing its duration down to 5m. Usually, at least one of these tests would hit the negative refcount assertion. I then incrementally added more and more logging around entry application until I painted a full picture of which logical ops were being consumed by the rangefeed processor and why the same raft command was being applied twice (once it became clear that one was). After a few more rounds of fine-tuning the logging, the interaction with reproposals with a new max lease index became clear.

cockroach-teamcity · 2019-07-24T01:50:58Z

This change is

ajwerner

just suggestions on comments from me. Good catch and nice fix.

Reviewed 3 of 3 files at r1, 5 of 5 files at r2.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)

pkg/storage/replica_application.go, line 388 at r1 (raw file):

	// Provide the command's corresponding logical operations to the Replica's
	// rangefeed. Only do so if the WriteBatch is non-nil, otherwise it's valid
	// for the logical op log to be nil, which would shut down all rangefeeds.

nit: I get that this is moved code but the second sentence here is a little difficult to understand. Any chance you could add some more words?

pkg/storage/replica_application_result.go, line 319 at r2 (raw file):

	var pErr *roachpb.Error
	if filter := r.store.cfg.TestingKnobs.TestingPostApplyFilter; filter != nil {

nit: TestingPostApplyFilter might benefit from a comment update that it will only be called on the proposing replica and will have no impact on follower. There's only one user and this PR doesn't change the behavior with regard to what happens to the return values but it does change how often it will be called. It wasn't totally obvious from the old code that setting pErr wouldn't do anything if the command wasn't proposed locally.

pkg/storage/replica_application_result.go, line 353 at r2 (raw file):

			// a new one. This is important for pipelined writes, since they
			// don't have a client watching to retry, so a failure to
			// eventually apply the proposal would be a uservisible error.

nit: user visible or user-visible

tbg

I'd love to hear a little bit about how you repro'ed and zero'ed in on this, perhaps even in the commit message.

Reviewed 3 of 3 files at r1, 5 of 5 files at r2.
Reviewable status: complete! 2 of 0 LGTMs obtained (waiting on @nvanbenschoten)

pkg/storage/replica_application.go, line 391 at r1 (raw file):

	// If no rangefeed is running, this call will be a noop.
	if cmd.raftCmd.WriteBatch != nil {
		r.handleLogicalOpLogRaftMuLocked(ctx, cmd.raftCmd.LogicalOpLog, batch)

You mention in the commit message that the lookup operates directly on the WriteBatch, but this is still operating on batch. This is probably more efficient than hitting the engine directly if and only if the read is satisfied from what's in the batch. Additionally, reading from the bash may flush some stuff down and be extra expensive? Anyway, reading from the batch is all good because that's the only safe thing to do, but I'm not convinced this improves performance. Unless somehow the fixed overhead of opening a handle into the DB is the main cost here.
If we wanted to go all in we could probably read directly from the WriteBatch first, and fall back to the engine (in approximately 100% of cases). But likely none of that is ever worth it, see if you want to adjust any of the wording though.

pkg/storage/replica_application.go, line 159 at r2 (raw file):

, in which case the proposal was reproposed (either under its original or a new MaxLeaseIndex) which we handle in a second pass below.

pkg/storage/replica_application.go, line 173 at r2 (raw file):

	for ok := it.init(&b.cmdBuf); ok; ok = it.next() {
		cmd := it.cur()
		if !cmd.proposedLocally() {

This is not a PR for silly micro-optimizations, but usually you have only a single proposer, so you could populate a bool anyProposedLocally in the previous loop and skip this whole second loop on the followers most of the time.

pkg/storage/replica_application.go, line 179 at r2 (raw file):

reproposed with a higher lease index.

pkg/storage/replica_application.go, line 188 at r2 (raw file):

If tryReproposeWithNewLeaseIndex picks up the proposal on failure, it will re-add the proposal to the proposal map, but this won't affect anything in this cmdAppBatch.

pkg/storage/replica_application_result.go, line 391 at r2 (raw file):

// has already been successfully applied or has been reproposed here or by a
// different entry for the same proposal that hit an illegal lease index error.
func (r *Replica) tryReproposeWithNewLeaseIndex(

Pure movement, correct?

nvb

TFTRs!

I'd love to hear a little bit about how you repro'ed and zero'ed in on this, perhaps even in the commit message.

I added a bit about this to the commit message.

Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @ajwerner and @tbg)

pkg/storage/replica_application.go, line 388 at r1 (raw file):

Previously, ajwerner wrote…

nit: I get that this is moved code but the second sentence here is a little difficult to understand. Any chance you could add some more words?

Done.

pkg/storage/replica_application.go, line 391 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

You mention in the commit message that the lookup operates directly on the WriteBatch, but this is still operating on batch. This is probably more efficient than hitting the engine directly if and only if the read is satisfied from what's in the batch. Additionally, reading from the bash may flush some stuff down and be extra expensive? Anyway, reading from the batch is all good because that's the only safe thing to do, but I'm not convinced this improves performance. Unless somehow the fixed overhead of opening a handle into the DB is the main cost here.
If we wanted to go all in we could probably read directly from the WriteBatch first, and fall back to the engine (in approximately 100% of cases). But likely none of that is ever worth it, see if you want to adjust any of the wording though.

Clarified. You're right that the read will be satisfied from the batch for MVCCWriteValueOps but not for MVCCCommitIntentOps.

pkg/storage/replica_application.go, line 159 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

, in which case the proposal was reproposed (either under its original or a new MaxLeaseIndex) which we handle in a second pass below.

Done.

pkg/storage/replica_application.go, line 173 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This is not a PR for silly micro-optimizations, but usually you have only a single proposer, so you could populate a bool anyProposedLocally in the previous loop and skip this whole second loop on the followers most of the time.

Something like https://github.com/cockroachdb/cockroach/pull/38954/files#diff-5bbfc08e6144696413cb26f2de677022R275? Since I'm planning on adding it there, I might as well make the change now. Done.

pkg/storage/replica_application.go, line 179 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

reproposed with a higher lease index.

Done.

pkg/storage/replica_application.go, line 188 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

If tryReproposeWithNewLeaseIndex picks up the proposal on failure, it will re-add the proposal to the proposal map, but this won't affect anything in this cmdAppBatch.

Done.

pkg/storage/replica_application_result.go, line 319 at r2 (raw file):

Previously, ajwerner wrote…

nit: TestingPostApplyFilter might benefit from a comment update that it will only be called on the proposing replica and will have no impact on follower. There's only one user and this PR doesn't change the behavior with regard to what happens to the return values but it does change how often it will be called. It wasn't totally obvious from the old code that setting pErr wouldn't do anything if the command wasn't proposed locally.

Good point. Done.

pkg/storage/replica_application_result.go, line 353 at r2 (raw file):

Previously, ajwerner wrote…

nit: user visible or user-visible

Done.

pkg/storage/replica_application_result.go, line 391 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Pure movement, correct?

No, this is completely rewritten to be a lot cleaner, to no longer need the replica mutex, and to properly handle errors. This was all made possible by the if cmd.raftCmd.MaxLeaseIndex != cmd.proposal.command.MaxLeaseIndex { check in retrieveLocalProposals.

nvb · 2019-07-24T13:49:49Z

bors r+

craig · 2019-07-24T14:11:26Z

Build failed

GitHub CI (Cockroach)

nvb · 2019-07-24T14:34:57Z

F190724 13:56:46.869487 538738 storage/replica_proposal.go:131  [n1,s1,r16/1:/Table/2{0-1}] proposal already applied: &{ctx:0xc011aabd40 sp:<nil> idKey:N��9�Wi_ proposedAtTicks:5 command:0xc022707900 encodedCommand:[0 78 136 220 57 189 87 105 95 18 6 8 1 16 1 24 1 48 1 106 19 66 10 8 147 148 240 195 128 142 151 218 21 146 1 4 96 116 104 2 114 75 10 73 0 0 0 0 0 0 0 0 1 0 0 0 1 29 1 107 18 156 0 1 113 108 112 116 99 111 110 115 105 115 116 101 110 99 121 67 104 101 99 107 101 114 0 29 18 4 8 0 16 0 24 0 32 0 40 0 50 15 54 176 12 30 3 8 178 152 233 195 128 142 151 218 21 32 13] quotaSize:108 tmpFooter:{MaxLeaseIndex:13} ec:{repl:0xc005727800 lg:0xc00e41fd60} applied:true doneCh:<nil> Local:LocalResult (reply: (err: <nil>), *roachpb.PutResponse, #intents: 0, #endTxns: 0 #updated txns: 0, GossipFirstRange:false MaybeGossipSystemConfig:false MaybeAddToSplitQueue:false MaybeGossipNodeLiveness:<nil> MaybeWatchForMerge:false Request:Put [/Local/Range/Table/20/QueueLastProcessed/"consistencyChecker",/Min)}

This might be good.

There are no known issues caused by this, but it seems bad. 509baff introduced batching of Raft entry application. An entire batch of entries is applied and then each of their LogicalOpLogs are consumed if the range has a running rangefeed processor. During this consumption phase, values may be read from the Store's engine (so that they don't need to be duplicated in an entries WriteBatch and LogicalOpLog). Since we're no longer performing this apply-then-consumer cycle one entry at a time, it seems possible for a later entry a batch to overwrite the value that an earlier entry in the batch wants to read in handleLogicalOpLogRaftMuLocked. This would cause a rangefeed to produce incorrect results. This commit fixes this issue by consuming logical ops as entries are staged in the batch instead of after the batch is applied. To facilitate this, the lookup in handleLogicalOpLogRaftMuLocked now operates on the WriteBatch directly instead of on the Store's engine. This is likely more efficient when the read is satisfied from what's in the batch (on MVCCWriteValueOp but likely not on MVCCCommitIntentOp). Either way, it simplifies this logic. Release note: None

…ation Fixes cockroachdb#39018. Fixes cockroachdb#37810. May fix other tests. This commit fixes a bug introduced in e4ce717 which allowed a single Raft proposal to be applied multiple times at multiple applied indexes. The bug was possible if a raft proposal was reproposed twice, once with the same max lease index and once with a new max lease index. Because there are two entries for the same proposal, one necessarily has to have an invalid max lease applied index (see the invariant in https://github.com/cockroachdb/cockroach/blob/5cbc4b50712546465de75dba69a1c0cdd1fe2f87/pkg/storage/replica_raft.go#L1430) If these two entries happened to land in the same application batch on the leaseholder then the first entry would be rejected and the second would apply. The replicas LeaseAppliedIndex would then be bumped all the way to the max lease index of the entry that applied. The bug occurred when the first entry (who must have hit a proposalIllegalLeaseIndex), called into tryReproposeWithNewLeaseIndex. The ProposalData's MaxLeaseIndex would be equal to the Replica's LeaseAppliedIndex because it would match the index in the successful entry. We would then repropose the proposal with a larger lease applied index. This new entry could then apply and result in duplicate entry application. Luckily, rangefeed's intent reference counting was sensitive enough that it caught this duplicate entry application and panicked loudly. Other tests might also be failing because of it but might not have as obvious of symptoms when they hit the bug. In addition to this primary bug fix, this commit has a secondary effect of fixing an issue where two entries for the same command could be in the same batch and only one would be linked to its ProposalData struct and be considered locally proposed (see the change in retrieveLocalProposals). I believe that this would prevent the command from being properly acknowledged if the first entry was rejected due to its max lease index and the second entry had a larger max lease index and applied. I think the first entry would have eventually hit the check in tryReproposeWithNewLeaseIndex and observed that the linked ProposalData had a new MaxLeaseIndex so it would have added it back to the proposal map, but then it would have had to wait for refreshProposalsLocked to refresh the proposal, at which point this refresh would have hit a lease index error and would be reproposed at a higher index. Not only could this cause duplicate versions of the same command to apply (described above), but I think this could even loop forever without acknowledging the client. It seems like there should be a way for this to cause cockroachdb#39022, but it doesn't exactly line up. Again, these kinds of cases will be easier to test once we properly mock out these interfaces in cockroachdb#38954. I'm working on that, but I don't think it should hold up the next alpha (cockroachdb#39036). However, this commit does address a TODO to properly handle errors during tryReproposeWithNewLeaseIndex reproposals and that is properly tested. My debugging process to track this down was to repeatedly run a set of 10 `cdc/ledger/rangefeed=true` roachtests after reducing its duration down to 5m. Usually, at least one of these tests would hit the `negative refcount` assertion. I then incrementally added more and more logging around entry application until I painted a full picture of which logical ops were being consumed by the rangefeed processor and why the same raft command was being applied twice (once it became clear that one was). After a few more rounds of fine-tuning the logging, the interaction with reproposals with a new max lease index became clear. Release note: None

nvb · 2019-07-24T15:07:02Z

The assertion I added to detect duplicate command application wasn't ignoring expected errors on the second application attempt. With that added, the test stresses without issue.

bors r+

39064: storage: prevent command reproposal with new lease index after application r=nvanbenschoten a=nvanbenschoten Fixes #39018. Fixes #37810. May fix other tests. This commit fixes a bug introduced in e4ce717 which allowed a single Raft proposal to be applied multiple times at multiple applied indexes. The bug was possible if a raft proposal was reproposed twice, once with the same max lease index and once with a new max lease index. Because there are two entries for the same proposal, one necessarily has to have an invalid max lease applied index (see the invariant in https://github.com/cockroachdb/cockroach/blob/5cbc4b50712546465de75dba69a1c0cdd1fe2f87/pkg/storage/replica_raft.go#L1430) If these two entries happened to land in the same application batch on the leaseholder then the first entry would be rejected and the second would apply. The replicas LeaseAppliedIndex would then be bumped all the way to the max lease index of the entry that applied. The bug occurred when the first entry (who must have hit a proposalIllegalLeaseIndex), called into tryReproposeWithNewLeaseIndex. The ProposalData's MaxLeaseIndex would be equal to the Replica's LeaseAppliedIndex because it would match the index in the successful entry. We would then repropose the proposal with a larger lease applied index. This new entry could then apply and result in duplicate entry application. Luckily, rangefeed's intent reference counting was sensitive enough that it caught this duplicate entry application and panicked loudly. Other tests might also be failing because of it but might not have as obvious symptoms when they hit the bug. In addition to this primary bug fix, this commit has a secondary effect of fixing an issue where two entries for the same command could be in the same batch and only one would be linked to its ProposalData struct and be considered locally proposed (see the change in retrieveLocalProposals). I believe that this would prevent the command from being properly acknowledged if the first entry was rejected due to its max lease index and the second entry had a larger max lease index and applied. I think the first entry would have eventually hit the check in tryReproposeWithNewLeaseIndex and observed that the linked ProposalData had a new MaxLeaseIndex so it would have added it back to the proposal map, but then it would have had to wait for refreshProposalsLocked to refresh the proposal, at which point this refresh would have hit a lease index error and would be reproposed at a higher index. Not only could this cause duplicate versions of the same command to apply (described above), but I think this could even loop forever without acknowledging the client. It seems like there should be a way for this to cause #39022, but it doesn't exactly line up. Again, these kinds of cases will be easier to test once we properly mock out these interfaces in #38954. I'm working on that, but I don't think it should hold up the next alpha (#39036). However, this commit does address a TODO to properly handle errors during tryReproposeWithNewLeaseIndex reproposals and that is properly tested. My debugging process to track this down was to repeatedly run a set of 10 `cdc/ledger/rangefeed=true` roachtests after reducing its duration down to 5m. Usually, at least one of these tests would hit the `negative refcount` assertion. I then incrementally added more and more logging around entry application until I painted a full picture of which logical ops were being consumed by the rangefeed processor and why the same raft command was being applied twice (once it became clear that one was). After a few more rounds of fine-tuning the logging, the interaction with reproposals with a new max lease index became clear. Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

craig · 2019-07-24T15:37:43Z

Build succeeded

GitHub CI (Cockroach)

The new package provides abstractions and routines associated with the application of committed raft entries to a replicated state machine. This was inspired by four driving forces: - We've been having a number of discussions on the Core team about making storage abstractions more clear and easier to understand in isolation. One commonly discussed proposal is introducing a `storage/replicate` package that would encapsulate the concerns of raft replication (e.g. log manipulation, snapshots, leader election, heartbeats, etc.). This `storage/apply` package will fit in nicely alongside a replication abstraction. - Initial discussion on cockroachdb#38954 concluded that adding an optimization to acknowledge clients after their raft entries have committed but before they had been applied with the current code structure was moving in the opposite direction and making things even harder to understand due to the introduction of more complex state management. - Recent instability in this area (cockroachdb#38976, cockroachdb#39064, cockroachdb#39135, cockroachdb#39203) has revealed that there exists a high degree of difficulty involved in testing any of the logic in the area of raft entry application. This has naturally led to testing at a distance using tools like testing hooks, which is frustrating and delicate. As a result, we're missing tests for thing like old migrations that we still need to support. We also have trouble writing regression tests when bugs do pop up. - The proposed optimization in cockroachdb#17500 (comment) to apply committed raft entries to the Replica storage engine asynchronously in a separate thread than the raft processing thread will make entry application significantly more complex. For instance, we'll likely need to introduce a separate scheduler to coordinate entry application passes across Ranges on a node. The schedule will likely want to prioritize leaders over followers and embed other policies to optimize for total system throughput. There's a strong desire to isolate this new complexity and to give the logic a place to live. The PR begins to address these concerns by formalizing the process of applying committed raft entries. To start, this makes the process easier to understand both in terms of the macro-level steps that are taken during application of a batch of entries and in terms of the impact that an individual command has on the replicated state machine. For instance, the PR helps provide answers to all of the following questions: - What are the stages of raft entry application? - What is the difference between a "raft entry" and a "replicated command"? - What can a command do besides apply its write batch to the storage engine? - What does it mean for a successfully replicated command to be rejected during application? - When can we acknowledge the outcome of a raft proposal? The refactor also uncovers a large testing surface that future PRs will exploit to write targeted unit tests. Not only can the `storage/apply` package be tested with a mock state machine (done in this PR), but we can test Replica's implementation of the state machine interface in isolation without needing to touch raft at all. Finally, the refactor paves the way for making the proposed change in cockroachdb#38954 in a much cleaner way. This is demonstrated in next commit, which is being included here to show why certain things were designed the way they were but will not be merged with this PR. Release note: None

39254: storage/apply: create apply package for raft entry application r=nvanbenschoten a=nvanbenschoten The new package provides abstractions and routines associated with the application of committed raft entries to a replicated state machine. This was inspired by four driving forces: - We've been having a number of discussions on the Core team about making storage abstractions more clear and easier to understand in isolation. One commonly discussed proposal is introducing a `storage/replicate` package that would encapsulate the concerns of raft replication (e.g. log manipulation, snapshots, leader election, heartbeats, etc.). This `storage/apply` package will fit in nicely alongside a replication abstraction. - Initial discussion on #38954 concluded that adding an optimization to acknowledge clients after their raft entries have committed but before they had been applied with the current code structure was moving in the opposite direction and making things even harder to understand due to the introduction of more complex state management. - Recent instability in this area (#38976, #39064, #39135, #39203) has revealed that there exists a high degree of difficulty involved in testing any of the logic in the area of raft entry application. This has naturally led to testing at a distance using tools like testing hooks, which is frustrating and delicate. As a result, we're missing tests for things like old migrations that we still need to support. We also have trouble writing regression tests when bugs do pop up. - The proposed optimization in #17500 (comment) to apply committed raft entries to the Replica storage engine asynchronously in a separate thread than the raft processing thread will make entry application significantly more complex. For instance, we'll likely need to introduce a separate scheduler to coordinate entry application passes across Ranges on a node. The schedule will likely want to prioritize leaders over followers and embed other policies to optimize for total system throughput. There's a strong desire to isolate this new complexity and to give the logic a place to live. The PR begins to address these concerns by formalizing the process of applying committed raft entries. To start, this makes the process easier to understand both in terms of the macro-level steps that are taken during application of a batch of entries and in terms of the impact that an individual command has on the replicated state machine. For instance, the PR helps provide answers to all of the following questions: - What are the stages of raft entry application? - What is the difference between a "raft entry" and a "replicated command"? - What can a command do besides apply its write batch to the storage engine? - What does it mean for a successfully replicated command to be rejected during application? - When can we acknowledge the outcome of a raft proposal? The refactor also uncovers a large testing surface that future PRs will exploit to write targeted unit tests. Not only can the `storage/apply` package be tested with a mock state machine (done in this PR), but we can test Replica's implementation of the state machine interface in isolation without needing to touch raft at all. Finally, the refactor paves the way for making the proposed change in #38954 in a much cleaner way. This is demonstrated in the second commit, which is being included here to show why certain things were designed the way they were but will not be merged with this PR. Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

The new package provides abstractions and routines associated with the application of committed raft entries to a replicated state machine. This was inspired by four driving forces: - We've been having a number of discussions on the Core team about making storage abstractions more clear and easier to understand in isolation. One commonly discussed proposal is introducing a `storage/replicate` package that would encapsulate the concerns of raft replication (e.g. log manipulation, snapshots, leader election, heartbeats, etc.). This `storage/apply` package will fit in nicely alongside a replication abstraction. - Initial discussion on cockroachdb#38954 concluded that adding an optimization to acknowledge clients after their raft entries have committed but before they had been applied with the current code structure was moving in the opposite direction and making things even harder to understand due to the introduction of more complex state management. - Recent instability in this area (cockroachdb#38976, cockroachdb#39064, cockroachdb#39135, cockroachdb#39203) has revealed that there exists a high degree of difficulty involved in testing any of the logic in the area of raft entry application. This has naturally led to testing at a distance using tools like testing hooks, which is frustrating and delicate. As a result, we're missing tests for thing like old migrations that we still need to support. We also have trouble writing regression tests when bugs do pop up. - The proposed optimization in cockroachdb#17500 (comment) to apply committed raft entries to the Replica storage engine asynchronously in a separate thread than the raft processing thread will make entry application significantly more complex. For instance, we'll likely need to introduce a separate scheduler to coordinate entry application passes across Ranges on a node. The schedule will likely want to prioritize leaders over followers and embed other policies to optimize for total system throughput. There's a strong desire to isolate this new complexity and to give the logic a place to live. The PR begins to address these concerns by formalizing the process of applying committed raft entries. To start, this makes the process easier to understand both in terms of the macro-level steps that are taken during application of a batch of entries and in terms of the impact that an individual command has on the replicated state machine. For instance, the PR helps provide answers to all of the following questions: - What are the stages of raft entry application? - What is the difference between a "raft entry" and a "replicated command"? - What can a command do besides apply its write batch to the storage engine? - What does it mean for a successfully replicated command to be rejected during application? - When can we acknowledge the outcome of a raft proposal? The refactor also uncovers a large testing surface that future PRs will exploit to write targeted unit tests. Not only can the `storage/apply` package be tested with a mock state machine (done in this PR), but we can test Replica's implementation of the state machine interface in isolation without needing to touch raft at all. Finally, the refactor paves the way for making the proposed change in cockroachdb#38954 in a much cleaner way. This is demonstrated in next commit, which is being included here to show why certain things were designed the way they were but will not be merged with this PR. Release note: None

nvb requested review from a team, ajwerner and tbg July 24, 2019 01:50

nvb force-pushed the nvanbenschoten/cdcFix branch from 2328dfc to 01757dc Compare July 24, 2019 02:11

ajwerner approved these changes Jul 24, 2019

View reviewed changes

tbg approved these changes Jul 24, 2019

View reviewed changes

nvb force-pushed the nvanbenschoten/cdcFix branch from 01757dc to 6299f81 Compare July 24, 2019 13:14

nvb commented Jul 24, 2019

View reviewed changes

nvb added 2 commits July 24, 2019 10:35

nvb force-pushed the nvanbenschoten/cdcFix branch from 6299f81 to aada4fc Compare July 24, 2019 15:06

craig bot merged commit aada4fc into cockroachdb:master Jul 24, 2019

nvb deleted the nvanbenschoten/cdcFix branch July 24, 2019 15:47

nvb mentioned this pull request Aug 2, 2019

storage/apply: create apply package for raft entry application #39254

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: prevent command reproposal with new lease index after application#39064

storage: prevent command reproposal with new lease index after application#39064
craig[bot] merged 2 commits intocockroachdb:masterfrom
nvb:nvanbenschoten/cdcFix

nvb commented Jul 24, 2019 •

edited

Loading

Uh oh!

cockroach-teamcity commented Jul 24, 2019

Uh oh!

ajwerner left a comment

Uh oh!

tbg left a comment

Uh oh!

nvb left a comment

Uh oh!

nvb commented Jul 24, 2019

Uh oh!

craig bot commented Jul 24, 2019

Uh oh!

nvb commented Jul 24, 2019

Uh oh!

nvb commented Jul 24, 2019

Uh oh!

craig bot commented Jul 24, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

nvb commented Jul 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented Jul 24, 2019

Uh oh!

ajwerner left a comment

Choose a reason for hiding this comment

Uh oh!

tbg left a comment

Choose a reason for hiding this comment

Uh oh!

nvb left a comment

Choose a reason for hiding this comment

Uh oh!

nvb commented Jul 24, 2019

Uh oh!

craig bot commented Jul 24, 2019

Build failed

Uh oh!

nvb commented Jul 24, 2019

Uh oh!

nvb commented Jul 24, 2019

Uh oh!

craig bot commented Jul 24, 2019

Build succeeded

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

nvb commented Jul 24, 2019 •

edited

Loading