Skip to content

storage: prevent command reproposal with new lease index after application#39064

Merged
craig[bot] merged 2 commits intocockroachdb:masterfrom
nvb:nvanbenschoten/cdcFix
Jul 24, 2019
Merged

storage: prevent command reproposal with new lease index after application#39064
craig[bot] merged 2 commits intocockroachdb:masterfrom
nvb:nvanbenschoten/cdcFix

Conversation

@nvb
Copy link
Copy Markdown
Contributor

@nvb nvb commented Jul 24, 2019

Fixes #39018.
Fixes #37810.
May fix other tests.

This commit fixes a bug introduced in e4ce717 which allowed a single Raft proposal to be applied multiple times at multiple applied indexes. The bug was possible if a raft proposal was reproposed twice, once with the same max lease index and once with a new max lease index. Because there are two entries for the same proposal, one necessarily has to have an invalid max lease applied index (see the invariant in

// it again. This ensures that at any time, there is only up to a
) If these two entries happened to land in the same application batch on the leaseholder then the first entry would be rejected and the second would apply. The replicas LeaseAppliedIndex would then be bumped all the way to the max lease index of the entry that applied. The bug occurred when the first entry (who must have hit a proposalIllegalLeaseIndex), called into tryReproposeWithNewLeaseIndex. The ProposalData's MaxLeaseIndex would be equal to the Replica's LeaseAppliedIndex because it would match the index in the successful entry. We would then repropose the proposal with a larger lease applied index. This new entry could then apply and result in duplicate entry application.

Luckily, rangefeed's intent reference counting was sensitive enough that it caught this duplicate entry application and panicked loudly. Other tests might also be failing because of it but might not have as obvious symptoms when they hit the bug.

In addition to this primary bug fix, this commit has a secondary effect of fixing an issue where two entries for the same command could be in the same batch and only one would be linked to its ProposalData struct and be considered locally proposed (see the change in retrieveLocalProposals). I believe that this would prevent the command from being properly acknowledged if the first entry was rejected due to its max lease index and the second entry had a larger max lease index and applied. I think the first entry would have eventually hit the check in tryReproposeWithNewLeaseIndex and observed that the linked ProposalData had a new MaxLeaseIndex so it would have added it back to the proposal map, but then it would have had to wait for refreshProposalsLocked to refresh the proposal, at which point this refresh would have hit a lease index error and would be reproposed at a higher index. Not only could this cause duplicate versions of the same command to apply (described above), but I think this could even loop forever without acknowledging the client. It seems like there should be a way for this to cause #39022, but it doesn't exactly line up.

Again, these kinds of cases will be easier to test once we properly mock out these interfaces in #38954. I'm working on that, but I don't think it should hold up the next alpha (#39036). However, this commit does address a TODO to properly handle errors during tryReproposeWithNewLeaseIndex reproposals and that is properly tested.

My debugging process to track this down was to repeatedly run a set of 10 cdc/ledger/rangefeed=true roachtests after reducing its duration down to 5m. Usually, at least one of these tests would hit the negative refcount assertion. I then incrementally added more and more logging around entry application until I painted a full picture of which logical ops were being consumed by the rangefeed processor and why the same raft command was being applied twice (once it became clear that one was). After a few more rounds of fine-tuning the logging, the interaction with reproposals with a new max lease index became clear.

@nvb nvb requested review from a team, ajwerner and tbg July 24, 2019 01:50
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@nvb nvb force-pushed the nvanbenschoten/cdcFix branch from 2328dfc to 01757dc Compare July 24, 2019 02:11
Copy link
Copy Markdown
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: just suggestions on comments from me. Good catch and nice fix.

Reviewed 3 of 3 files at r1, 5 of 5 files at r2.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)


pkg/storage/replica_application.go, line 388 at r1 (raw file):

	// Provide the command's corresponding logical operations to the Replica's
	// rangefeed. Only do so if the WriteBatch is non-nil, otherwise it's valid
	// for the logical op log to be nil, which would shut down all rangefeeds.

nit: I get that this is moved code but the second sentence here is a little difficult to understand. Any chance you could add some more words?


pkg/storage/replica_application_result.go, line 319 at r2 (raw file):

	var pErr *roachpb.Error
	if filter := r.store.cfg.TestingKnobs.TestingPostApplyFilter; filter != nil {

nit: TestingPostApplyFilter might benefit from a comment update that it will only be called on the proposing replica and will have no impact on follower. There's only one user and this PR doesn't change the behavior with regard to what happens to the return values but it does change how often it will be called. It wasn't totally obvious from the old code that setting pErr wouldn't do anything if the command wasn't proposed locally.


pkg/storage/replica_application_result.go, line 353 at r2 (raw file):

			// a new one. This is important for pipelined writes, since they
			// don't have a client watching to retry, so a failure to
			// eventually apply the proposal would be a uservisible error.

nit: user visible or user-visible

Copy link
Copy Markdown
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

I'd love to hear a little bit about how you repro'ed and zero'ed in on this, perhaps even in the commit message.

Reviewed 3 of 3 files at r1, 5 of 5 files at r2.
Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @nvanbenschoten)


pkg/storage/replica_application.go, line 391 at r1 (raw file):

	// If no rangefeed is running, this call will be a noop.
	if cmd.raftCmd.WriteBatch != nil {
		r.handleLogicalOpLogRaftMuLocked(ctx, cmd.raftCmd.LogicalOpLog, batch)

You mention in the commit message that the lookup operates directly on the WriteBatch, but this is still operating on batch. This is probably more efficient than hitting the engine directly if and only if the read is satisfied from what's in the batch. Additionally, reading from the bash may flush some stuff down and be extra expensive? Anyway, reading from the batch is all good because that's the only safe thing to do, but I'm not convinced this improves performance. Unless somehow the fixed overhead of opening a handle into the DB is the main cost here.
If we wanted to go all in we could probably read directly from the WriteBatch first, and fall back to the engine (in approximately 100% of cases). But likely none of that is ever worth it, see if you want to adjust any of the wording though.


pkg/storage/replica_application.go, line 159 at r2 (raw file):

, in which case the proposal was reproposed (either under its original or a new MaxLeaseIndex) which we handle in a second pass below.


pkg/storage/replica_application.go, line 173 at r2 (raw file):

	for ok := it.init(&b.cmdBuf); ok; ok = it.next() {
		cmd := it.cur()
		if !cmd.proposedLocally() {

This is not a PR for silly micro-optimizations, but usually you have only a single proposer, so you could populate a bool anyProposedLocally in the previous loop and skip this whole second loop on the followers most of the time.


pkg/storage/replica_application.go, line 179 at r2 (raw file):

reproposed with a higher lease index.


pkg/storage/replica_application.go, line 188 at r2 (raw file):

If tryReproposeWithNewLeaseIndex picks up the proposal on failure, it will re-add the proposal to the proposal map, but this won't affect anything in this cmdAppBatch.


pkg/storage/replica_application_result.go, line 391 at r2 (raw file):

// has already been successfully applied or has been reproposed here or by a
// different entry for the same proposal that hit an illegal lease index error.
func (r *Replica) tryReproposeWithNewLeaseIndex(

Pure movement, correct?

@nvb nvb force-pushed the nvanbenschoten/cdcFix branch from 01757dc to 6299f81 Compare July 24, 2019 13:14
Copy link
Copy Markdown
Contributor Author

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTRs!

I'd love to hear a little bit about how you repro'ed and zero'ed in on this, perhaps even in the commit message.

I added a bit about this to the commit message.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @ajwerner and @tbg)


pkg/storage/replica_application.go, line 388 at r1 (raw file):

Previously, ajwerner wrote…

nit: I get that this is moved code but the second sentence here is a little difficult to understand. Any chance you could add some more words?

Done.


pkg/storage/replica_application.go, line 391 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

You mention in the commit message that the lookup operates directly on the WriteBatch, but this is still operating on batch. This is probably more efficient than hitting the engine directly if and only if the read is satisfied from what's in the batch. Additionally, reading from the bash may flush some stuff down and be extra expensive? Anyway, reading from the batch is all good because that's the only safe thing to do, but I'm not convinced this improves performance. Unless somehow the fixed overhead of opening a handle into the DB is the main cost here.
If we wanted to go all in we could probably read directly from the WriteBatch first, and fall back to the engine (in approximately 100% of cases). But likely none of that is ever worth it, see if you want to adjust any of the wording though.

Clarified. You're right that the read will be satisfied from the batch for MVCCWriteValueOps but not for MVCCCommitIntentOps.


pkg/storage/replica_application.go, line 159 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

, in which case the proposal was reproposed (either under its original or a new MaxLeaseIndex) which we handle in a second pass below.

Done.


pkg/storage/replica_application.go, line 173 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This is not a PR for silly micro-optimizations, but usually you have only a single proposer, so you could populate a bool anyProposedLocally in the previous loop and skip this whole second loop on the followers most of the time.

Something like https://github.com/cockroachdb/cockroach/pull/38954/files#diff-5bbfc08e6144696413cb26f2de677022R275? Since I'm planning on adding it there, I might as well make the change now. Done.


pkg/storage/replica_application.go, line 179 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

reproposed with a higher lease index.

Done.


pkg/storage/replica_application.go, line 188 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

If tryReproposeWithNewLeaseIndex picks up the proposal on failure, it will re-add the proposal to the proposal map, but this won't affect anything in this cmdAppBatch.

Done.


pkg/storage/replica_application_result.go, line 319 at r2 (raw file):

Previously, ajwerner wrote…

nit: TestingPostApplyFilter might benefit from a comment update that it will only be called on the proposing replica and will have no impact on follower. There's only one user and this PR doesn't change the behavior with regard to what happens to the return values but it does change how often it will be called. It wasn't totally obvious from the old code that setting pErr wouldn't do anything if the command wasn't proposed locally.

Good point. Done.


pkg/storage/replica_application_result.go, line 353 at r2 (raw file):

Previously, ajwerner wrote…

nit: user visible or user-visible

Done.


pkg/storage/replica_application_result.go, line 391 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Pure movement, correct?

No, this is completely rewritten to be a lot cleaner, to no longer need the replica mutex, and to properly handle errors. This was all made possible by the if cmd.raftCmd.MaxLeaseIndex != cmd.proposal.command.MaxLeaseIndex { check in retrieveLocalProposals.

@nvb
Copy link
Copy Markdown
Contributor Author

nvb commented Jul 24, 2019

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jul 24, 2019

Build failed

@nvb
Copy link
Copy Markdown
Contributor Author

nvb commented Jul 24, 2019

F190724 13:56:46.869487 538738 storage/replica_proposal.go:131  [n1,s1,r16/1:/Table/2{0-1}] proposal already applied: &{ctx:0xc011aabd40 sp:<nil> idKey:N��9�Wi_ proposedAtTicks:5 command:0xc022707900 encodedCommand:[0 78 136 220 57 189 87 105 95 18 6 8 1 16 1 24 1 48 1 106 19 66 10 8 147 148 240 195 128 142 151 218 21 146 1 4 96 116 104 2 114 75 10 73 0 0 0 0 0 0 0 0 1 0 0 0 1 29 1 107 18 156 0 1 113 108 112 116 99 111 110 115 105 115 116 101 110 99 121 67 104 101 99 107 101 114 0 29 18 4 8 0 16 0 24 0 32 0 40 0 50 15 54 176 12 30 3 8 178 152 233 195 128 142 151 218 21 32 13] quotaSize:108 tmpFooter:{MaxLeaseIndex:13} ec:{repl:0xc005727800 lg:0xc00e41fd60} applied:true doneCh:<nil> Local:LocalResult (reply: (err: <nil>), *roachpb.PutResponse, #intents: 0, #endTxns: 0 #updated txns: 0, GossipFirstRange:false MaybeGossipSystemConfig:false MaybeAddToSplitQueue:false MaybeGossipNodeLiveness:<nil> MaybeWatchForMerge:false Request:Put [/Local/Range/Table/20/QueueLastProcessed/"consistencyChecker",/Min)}

This might be good.

nvb added 2 commits July 24, 2019 10:35
There are no known issues caused by this, but it seems bad.

509baff introduced batching of Raft entry application. An entire batch of
entries is applied and then each of their LogicalOpLogs are consumed if the
range has a running rangefeed processor. During this consumption phase, values
may be read from the Store's engine (so that they don't need to be duplicated in
an entries WriteBatch and LogicalOpLog). Since we're no longer performing this
apply-then-consumer cycle one entry at a time, it seems possible for a later
entry a batch to overwrite the value that an earlier entry in the batch wants
to read in handleLogicalOpLogRaftMuLocked. This would cause a rangefeed to
produce incorrect results.

This commit fixes this issue by consuming logical ops as entries are staged
in the batch instead of after the batch is applied. To facilitate this, the
lookup in handleLogicalOpLogRaftMuLocked now operates on the WriteBatch
directly instead of on the Store's engine. This is likely more efficient
when the read is satisfied from what's in the batch (on MVCCWriteValueOp
but likely not on MVCCCommitIntentOp). Either way, it simplifies this logic.

Release note: None
…ation

Fixes cockroachdb#39018.
Fixes cockroachdb#37810.
May fix other tests.

This commit fixes a bug introduced in e4ce717 which allowed a single Raft
proposal to be applied multiple times at multiple applied indexes. The
bug was possible if a raft proposal was reproposed twice, once with the
same max lease index and once with a new max lease index. Because there
are two entries for the same proposal, one necessarily has to have an
invalid max lease applied index (see the invariant in https://github.com/cockroachdb/cockroach/blob/5cbc4b50712546465de75dba69a1c0cdd1fe2f87/pkg/storage/replica_raft.go#L1430)
If these two entries happened to land in the same application batch on
the leaseholder then the first entry would be rejected and the second
would apply. The replicas LeaseAppliedIndex would then be bumped all the
way to the max lease index of the entry that applied. The bug occurred when
the first entry (who must have hit a proposalIllegalLeaseIndex), called into
tryReproposeWithNewLeaseIndex. The ProposalData's MaxLeaseIndex would be
equal to the Replica's LeaseAppliedIndex because it would match the index
in the successful entry. We would then repropose the proposal with a larger
lease applied index. This new entry could then apply and result in duplicate
entry application.

Luckily, rangefeed's intent reference counting was sensitive enough that it
caught this duplicate entry application and panicked loudly. Other tests might
also be failing because of it but might not have as obvious of symptoms when
they hit the bug.

In addition to this primary bug fix, this commit has a secondary effect of
fixing an issue where two entries for the same command could be in the same
batch and only one would be linked to its ProposalData struct and be considered
locally proposed (see the change in retrieveLocalProposals). I believe that this
would prevent the command from being properly acknowledged if the first entry
was rejected due to its max lease index and the second entry had a larger max
lease index and applied. I think the first entry would have eventually hit the
check in tryReproposeWithNewLeaseIndex and observed that the linked ProposalData
had a new MaxLeaseIndex so it would have added it back to the proposal map, but
then it would have had to wait for refreshProposalsLocked to refresh the
proposal, at which point this refresh would have hit a lease index error and
would be reproposed at a higher index. Not only could this cause duplicate
versions of the same command to apply (described above), but I think this could
even loop forever without acknowledging the client. It seems like there should
be a way for this to cause cockroachdb#39022, but it doesn't exactly line up.

Again, these kinds of cases will be easier to test once we properly mock out
these interfaces in cockroachdb#38954. I'm working on that, but I don't think it should
hold up the next alpha (cockroachdb#39036). However, this commit does address a TODO to
properly handle errors during tryReproposeWithNewLeaseIndex reproposals and
that is properly tested.

My debugging process to track this down was to repeatedly run a set of 10
`cdc/ledger/rangefeed=true` roachtests after reducing its duration down to
5m. Usually, at least one of these tests would hit the `negative refcount`
assertion. I then incrementally added more and more logging around entry
application until I painted a full picture of which logical ops were being
consumed by the rangefeed processor and why the same raft command was being
applied twice (once it became clear that one was). After a few more rounds
of fine-tuning the logging, the interaction with reproposals with a new max
lease index became clear.

Release note: None
@nvb nvb force-pushed the nvanbenschoten/cdcFix branch from 6299f81 to aada4fc Compare July 24, 2019 15:06
@nvb
Copy link
Copy Markdown
Contributor Author

nvb commented Jul 24, 2019

The assertion I added to detect duplicate command application wasn't ignoring expected errors on the second application attempt. With that added, the test stresses without issue.

bors r+

craig bot pushed a commit that referenced this pull request Jul 24, 2019
39064: storage: prevent command reproposal with new lease index after application r=nvanbenschoten a=nvanbenschoten

Fixes #39018.
Fixes #37810.
May fix other tests.

This commit fixes a bug introduced in e4ce717 which allowed a single Raft proposal to be applied multiple times at multiple applied indexes. The bug was possible if a raft proposal was reproposed twice, once with the same max lease index and once with a new max lease index. Because there are two entries for the same proposal, one necessarily has to have an invalid max lease applied index (see the invariant in https://github.com/cockroachdb/cockroach/blob/5cbc4b50712546465de75dba69a1c0cdd1fe2f87/pkg/storage/replica_raft.go#L1430) If these two entries happened to land in the same application batch on the leaseholder then the first entry would be rejected and the second would apply. The replicas LeaseAppliedIndex would then be bumped all the way to the max lease index of the entry that applied. The bug occurred when the first entry (who must have hit a proposalIllegalLeaseIndex), called into tryReproposeWithNewLeaseIndex. The ProposalData's MaxLeaseIndex would be equal to the Replica's LeaseAppliedIndex because it would match the index in the successful entry. We would then repropose the proposal with a larger lease applied index. This new entry could then apply and result in duplicate entry application.

Luckily, rangefeed's intent reference counting was sensitive enough that it caught this duplicate entry application and panicked loudly. Other tests might also be failing because of it but might not have as obvious symptoms when they hit the bug.

In addition to this primary bug fix, this commit has a secondary effect of fixing an issue where two entries for the same command could be in the same batch and only one would be linked to its ProposalData struct and be considered locally proposed (see the change in retrieveLocalProposals). I believe that this would prevent the command from being properly acknowledged if the first entry was rejected due to its max lease index and the second entry had a larger max lease index and applied. I think the first entry would have eventually hit the check in tryReproposeWithNewLeaseIndex and observed that the linked ProposalData had a new MaxLeaseIndex so it would have added it back to the proposal map, but then it would have had to wait for refreshProposalsLocked to refresh the proposal, at which point this refresh would have hit a lease index error and would be reproposed at a higher index. Not only could this cause duplicate versions of the same command to apply (described above), but I think this could even loop forever without acknowledging the client. It seems like there should be a way for this to cause #39022, but it doesn't exactly line up.

Again, these kinds of cases will be easier to test once we properly mock out these interfaces in #38954. I'm working on that, but I don't think it should hold up the next alpha (#39036). However, this commit does address a TODO to properly handle errors during tryReproposeWithNewLeaseIndex reproposals and that is properly tested.

My debugging process to track this down was to repeatedly run a set of 10 `cdc/ledger/rangefeed=true` roachtests after reducing its duration down to 5m. Usually, at least one of these tests would hit the `negative refcount` assertion. I then incrementally added more and more logging around entry application until I painted a full picture of which logical ops were being consumed by the rangefeed processor and why the same raft command was being applied twice (once it became clear that one was). After a few more rounds of fine-tuning the logging, the interaction with reproposals with a new max lease index became clear.

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jul 24, 2019

Build succeeded

@craig craig bot merged commit aada4fc into cockroachdb:master Jul 24, 2019
@nvb nvb deleted the nvanbenschoten/cdcFix branch July 24, 2019 15:47
nvb added a commit to nvb/cockroach that referenced this pull request Aug 2, 2019
The new package provides abstractions and routines associated with the
application of committed raft entries to a replicated state machine.

This was inspired by four driving forces:
- We've been having a number of discussions on the Core team about
  making storage abstractions more clear and easier to understand
  in isolation. One commonly discussed proposal is introducing a
  `storage/replicate` package that would encapsulate the concerns of
  raft replication (e.g. log manipulation, snapshots, leader election,
  heartbeats, etc.). This `storage/apply` package will fit in nicely
  alongside a replication abstraction.
- Initial discussion on cockroachdb#38954 concluded that adding an optimization
  to acknowledge clients after their raft entries have committed but
  before they had been applied with the current code structure was
  moving in the opposite direction and making things even harder to
  understand due to the introduction of more complex state management.
- Recent instability in this area (cockroachdb#38976, cockroachdb#39064, cockroachdb#39135, cockroachdb#39203) has
  revealed that there exists a high degree of difficulty involved in testing
  any of the logic in the area of raft entry application. This has naturally
  led to testing at a distance using tools like testing hooks, which is
  frustrating and delicate. As a result, we're missing tests for thing
  like old migrations that we still need to support. We also have trouble
  writing regression tests when bugs do pop up.
- The proposed optimization in cockroachdb#17500 (comment)
  to apply committed raft entries to the Replica storage engine asynchronously
  in a separate thread than the raft processing thread will make entry
  application significantly more complex. For instance, we'll likely
  need to introduce a separate scheduler to coordinate entry application
  passes across Ranges on a node. The schedule will likely want to
  prioritize leaders over followers and embed other policies to optimize
  for total system throughput. There's a strong desire to isolate this new
  complexity and to give the logic a place to live.

The PR begins to address these concerns by formalizing the process of
applying committed raft entries. To start, this makes the process easier
to understand both in terms of the macro-level steps that are taken during
application of a batch of entries and in terms of the impact that an individual
command has on the replicated state machine. For instance, the PR helps provide
answers to all of the following questions:

- What are the stages of raft entry application?
- What is the difference between a "raft entry" and a "replicated command"?
- What can a command do besides apply its write batch to the storage engine?
- What does it mean for a successfully replicated command to be rejected during application?
- When can we acknowledge the outcome of a raft proposal?

The refactor also uncovers a large testing surface that future PRs will exploit
to write targeted unit tests. Not only can the `storage/apply` package be tested
with a mock state machine (done in this PR), but we can test Replica's
implementation of the state machine interface in isolation without needing
to touch raft at all.

Finally, the refactor paves the way for making the proposed change in cockroachdb#38954
in a much cleaner way. This is demonstrated in next commit, which is being
included here to show why certain things were designed the way they were
but will not be merged with this PR.

Release note: None
nvb added a commit to nvb/cockroach that referenced this pull request Aug 5, 2019
The new package provides abstractions and routines associated with the
application of committed raft entries to a replicated state machine.

This was inspired by four driving forces:
- We've been having a number of discussions on the Core team about
  making storage abstractions more clear and easier to understand
  in isolation. One commonly discussed proposal is introducing a
  `storage/replicate` package that would encapsulate the concerns of
  raft replication (e.g. log manipulation, snapshots, leader election,
  heartbeats, etc.). This `storage/apply` package will fit in nicely
  alongside a replication abstraction.
- Initial discussion on cockroachdb#38954 concluded that adding an optimization
  to acknowledge clients after their raft entries have committed but
  before they had been applied with the current code structure was
  moving in the opposite direction and making things even harder to
  understand due to the introduction of more complex state management.
- Recent instability in this area (cockroachdb#38976, cockroachdb#39064, cockroachdb#39135, cockroachdb#39203) has
  revealed that there exists a high degree of difficulty involved in testing
  any of the logic in the area of raft entry application. This has naturally
  led to testing at a distance using tools like testing hooks, which is
  frustrating and delicate. As a result, we're missing tests for thing
  like old migrations that we still need to support. We also have trouble
  writing regression tests when bugs do pop up.
- The proposed optimization in cockroachdb#17500 (comment)
  to apply committed raft entries to the Replica storage engine asynchronously
  in a separate thread than the raft processing thread will make entry
  application significantly more complex. For instance, we'll likely
  need to introduce a separate scheduler to coordinate entry application
  passes across Ranges on a node. The schedule will likely want to
  prioritize leaders over followers and embed other policies to optimize
  for total system throughput. There's a strong desire to isolate this new
  complexity and to give the logic a place to live.

The PR begins to address these concerns by formalizing the process of
applying committed raft entries. To start, this makes the process easier
to understand both in terms of the macro-level steps that are taken during
application of a batch of entries and in terms of the impact that an individual
command has on the replicated state machine. For instance, the PR helps provide
answers to all of the following questions:

- What are the stages of raft entry application?
- What is the difference between a "raft entry" and a "replicated command"?
- What can a command do besides apply its write batch to the storage engine?
- What does it mean for a successfully replicated command to be rejected during application?
- When can we acknowledge the outcome of a raft proposal?

The refactor also uncovers a large testing surface that future PRs will exploit
to write targeted unit tests. Not only can the `storage/apply` package be tested
with a mock state machine (done in this PR), but we can test Replica's
implementation of the state machine interface in isolation without needing
to touch raft at all.

Finally, the refactor paves the way for making the proposed change in cockroachdb#38954
in a much cleaner way. This is demonstrated in next commit, which is being
included here to show why certain things were designed the way they were
but will not be merged with this PR.

Release note: None
nvb added a commit to nvb/cockroach that referenced this pull request Aug 6, 2019
The new package provides abstractions and routines associated with the
application of committed raft entries to a replicated state machine.

This was inspired by four driving forces:
- We've been having a number of discussions on the Core team about
  making storage abstractions more clear and easier to understand
  in isolation. One commonly discussed proposal is introducing a
  `storage/replicate` package that would encapsulate the concerns of
  raft replication (e.g. log manipulation, snapshots, leader election,
  heartbeats, etc.). This `storage/apply` package will fit in nicely
  alongside a replication abstraction.
- Initial discussion on cockroachdb#38954 concluded that adding an optimization
  to acknowledge clients after their raft entries have committed but
  before they had been applied with the current code structure was
  moving in the opposite direction and making things even harder to
  understand due to the introduction of more complex state management.
- Recent instability in this area (cockroachdb#38976, cockroachdb#39064, cockroachdb#39135, cockroachdb#39203) has
  revealed that there exists a high degree of difficulty involved in testing
  any of the logic in the area of raft entry application. This has naturally
  led to testing at a distance using tools like testing hooks, which is
  frustrating and delicate. As a result, we're missing tests for thing
  like old migrations that we still need to support. We also have trouble
  writing regression tests when bugs do pop up.
- The proposed optimization in cockroachdb#17500 (comment)
  to apply committed raft entries to the Replica storage engine asynchronously
  in a separate thread than the raft processing thread will make entry
  application significantly more complex. For instance, we'll likely
  need to introduce a separate scheduler to coordinate entry application
  passes across Ranges on a node. The schedule will likely want to
  prioritize leaders over followers and embed other policies to optimize
  for total system throughput. There's a strong desire to isolate this new
  complexity and to give the logic a place to live.

The PR begins to address these concerns by formalizing the process of
applying committed raft entries. To start, this makes the process easier
to understand both in terms of the macro-level steps that are taken during
application of a batch of entries and in terms of the impact that an individual
command has on the replicated state machine. For instance, the PR helps provide
answers to all of the following questions:

- What are the stages of raft entry application?
- What is the difference between a "raft entry" and a "replicated command"?
- What can a command do besides apply its write batch to the storage engine?
- What does it mean for a successfully replicated command to be rejected during application?
- When can we acknowledge the outcome of a raft proposal?

The refactor also uncovers a large testing surface that future PRs will exploit
to write targeted unit tests. Not only can the `storage/apply` package be tested
with a mock state machine (done in this PR), but we can test Replica's
implementation of the state machine interface in isolation without needing
to touch raft at all.

Finally, the refactor paves the way for making the proposed change in cockroachdb#38954
in a much cleaner way. This is demonstrated in next commit, which is being
included here to show why certain things were designed the way they were
but will not be merged with this PR.

Release note: None
nvb added a commit to nvb/cockroach that referenced this pull request Aug 6, 2019
The new package provides abstractions and routines associated with the
application of committed raft entries to a replicated state machine.

This was inspired by four driving forces:
- We've been having a number of discussions on the Core team about
  making storage abstractions more clear and easier to understand
  in isolation. One commonly discussed proposal is introducing a
  `storage/replicate` package that would encapsulate the concerns of
  raft replication (e.g. log manipulation, snapshots, leader election,
  heartbeats, etc.). This `storage/apply` package will fit in nicely
  alongside a replication abstraction.
- Initial discussion on cockroachdb#38954 concluded that adding an optimization
  to acknowledge clients after their raft entries have committed but
  before they had been applied with the current code structure was
  moving in the opposite direction and making things even harder to
  understand due to the introduction of more complex state management.
- Recent instability in this area (cockroachdb#38976, cockroachdb#39064, cockroachdb#39135, cockroachdb#39203) has
  revealed that there exists a high degree of difficulty involved in testing
  any of the logic in the area of raft entry application. This has naturally
  led to testing at a distance using tools like testing hooks, which is
  frustrating and delicate. As a result, we're missing tests for thing
  like old migrations that we still need to support. We also have trouble
  writing regression tests when bugs do pop up.
- The proposed optimization in cockroachdb#17500 (comment)
  to apply committed raft entries to the Replica storage engine asynchronously
  in a separate thread than the raft processing thread will make entry
  application significantly more complex. For instance, we'll likely
  need to introduce a separate scheduler to coordinate entry application
  passes across Ranges on a node. The schedule will likely want to
  prioritize leaders over followers and embed other policies to optimize
  for total system throughput. There's a strong desire to isolate this new
  complexity and to give the logic a place to live.

The PR begins to address these concerns by formalizing the process of
applying committed raft entries. To start, this makes the process easier
to understand both in terms of the macro-level steps that are taken during
application of a batch of entries and in terms of the impact that an individual
command has on the replicated state machine. For instance, the PR helps provide
answers to all of the following questions:

- What are the stages of raft entry application?
- What is the difference between a "raft entry" and a "replicated command"?
- What can a command do besides apply its write batch to the storage engine?
- What does it mean for a successfully replicated command to be rejected during application?
- When can we acknowledge the outcome of a raft proposal?

The refactor also uncovers a large testing surface that future PRs will exploit
to write targeted unit tests. Not only can the `storage/apply` package be tested
with a mock state machine (done in this PR), but we can test Replica's
implementation of the state machine interface in isolation without needing
to touch raft at all.

Finally, the refactor paves the way for making the proposed change in cockroachdb#38954
in a much cleaner way. This is demonstrated in next commit, which is being
included here to show why certain things were designed the way they were
but will not be merged with this PR.

Release note: None
nvb added a commit to nvb/cockroach that referenced this pull request Aug 6, 2019
The new package provides abstractions and routines associated with the
application of committed raft entries to a replicated state machine.

This was inspired by four driving forces:
- We've been having a number of discussions on the Core team about
  making storage abstractions more clear and easier to understand
  in isolation. One commonly discussed proposal is introducing a
  `storage/replicate` package that would encapsulate the concerns of
  raft replication (e.g. log manipulation, snapshots, leader election,
  heartbeats, etc.). This `storage/apply` package will fit in nicely
  alongside a replication abstraction.
- Initial discussion on cockroachdb#38954 concluded that adding an optimization
  to acknowledge clients after their raft entries have committed but
  before they had been applied with the current code structure was
  moving in the opposite direction and making things even harder to
  understand due to the introduction of more complex state management.
- Recent instability in this area (cockroachdb#38976, cockroachdb#39064, cockroachdb#39135, cockroachdb#39203) has
  revealed that there exists a high degree of difficulty involved in testing
  any of the logic in the area of raft entry application. This has naturally
  led to testing at a distance using tools like testing hooks, which is
  frustrating and delicate. As a result, we're missing tests for thing
  like old migrations that we still need to support. We also have trouble
  writing regression tests when bugs do pop up.
- The proposed optimization in cockroachdb#17500 (comment)
  to apply committed raft entries to the Replica storage engine asynchronously
  in a separate thread than the raft processing thread will make entry
  application significantly more complex. For instance, we'll likely
  need to introduce a separate scheduler to coordinate entry application
  passes across Ranges on a node. The schedule will likely want to
  prioritize leaders over followers and embed other policies to optimize
  for total system throughput. There's a strong desire to isolate this new
  complexity and to give the logic a place to live.

The PR begins to address these concerns by formalizing the process of
applying committed raft entries. To start, this makes the process easier
to understand both in terms of the macro-level steps that are taken during
application of a batch of entries and in terms of the impact that an individual
command has on the replicated state machine. For instance, the PR helps provide
answers to all of the following questions:

- What are the stages of raft entry application?
- What is the difference between a "raft entry" and a "replicated command"?
- What can a command do besides apply its write batch to the storage engine?
- What does it mean for a successfully replicated command to be rejected during application?
- When can we acknowledge the outcome of a raft proposal?

The refactor also uncovers a large testing surface that future PRs will exploit
to write targeted unit tests. Not only can the `storage/apply` package be tested
with a mock state machine (done in this PR), but we can test Replica's
implementation of the state machine interface in isolation without needing
to touch raft at all.

Finally, the refactor paves the way for making the proposed change in cockroachdb#38954
in a much cleaner way. This is demonstrated in next commit, which is being
included here to show why certain things were designed the way they were
but will not be merged with this PR.

Release note: None
nvb added a commit to nvb/cockroach that referenced this pull request Aug 7, 2019
The new package provides abstractions and routines associated with the
application of committed raft entries to a replicated state machine.

This was inspired by four driving forces:
- We've been having a number of discussions on the Core team about
  making storage abstractions more clear and easier to understand
  in isolation. One commonly discussed proposal is introducing a
  `storage/replicate` package that would encapsulate the concerns of
  raft replication (e.g. log manipulation, snapshots, leader election,
  heartbeats, etc.). This `storage/apply` package will fit in nicely
  alongside a replication abstraction.
- Initial discussion on cockroachdb#38954 concluded that adding an optimization
  to acknowledge clients after their raft entries have committed but
  before they had been applied with the current code structure was
  moving in the opposite direction and making things even harder to
  understand due to the introduction of more complex state management.
- Recent instability in this area (cockroachdb#38976, cockroachdb#39064, cockroachdb#39135, cockroachdb#39203) has
  revealed that there exists a high degree of difficulty involved in testing
  any of the logic in the area of raft entry application. This has naturally
  led to testing at a distance using tools like testing hooks, which is
  frustrating and delicate. As a result, we're missing tests for thing
  like old migrations that we still need to support. We also have trouble
  writing regression tests when bugs do pop up.
- The proposed optimization in cockroachdb#17500 (comment)
  to apply committed raft entries to the Replica storage engine asynchronously
  in a separate thread than the raft processing thread will make entry
  application significantly more complex. For instance, we'll likely
  need to introduce a separate scheduler to coordinate entry application
  passes across Ranges on a node. The schedule will likely want to
  prioritize leaders over followers and embed other policies to optimize
  for total system throughput. There's a strong desire to isolate this new
  complexity and to give the logic a place to live.

The PR begins to address these concerns by formalizing the process of
applying committed raft entries. To start, this makes the process easier
to understand both in terms of the macro-level steps that are taken during
application of a batch of entries and in terms of the impact that an individual
command has on the replicated state machine. For instance, the PR helps provide
answers to all of the following questions:

- What are the stages of raft entry application?
- What is the difference between a "raft entry" and a "replicated command"?
- What can a command do besides apply its write batch to the storage engine?
- What does it mean for a successfully replicated command to be rejected during application?
- When can we acknowledge the outcome of a raft proposal?

The refactor also uncovers a large testing surface that future PRs will exploit
to write targeted unit tests. Not only can the `storage/apply` package be tested
with a mock state machine (done in this PR), but we can test Replica's
implementation of the state machine interface in isolation without needing
to touch raft at all.

Finally, the refactor paves the way for making the proposed change in cockroachdb#38954
in a much cleaner way. This is demonstrated in next commit, which is being
included here to show why certain things were designed the way they were
but will not be merged with this PR.

Release note: None
craig bot pushed a commit that referenced this pull request Aug 7, 2019
39254: storage/apply: create apply package for raft entry application r=nvanbenschoten a=nvanbenschoten

The new package provides abstractions and routines associated with the application of committed raft entries to a replicated state machine.

This was inspired by four driving forces:
- We've been having a number of discussions on the Core team about making storage abstractions more clear and easier to understand in isolation. One commonly discussed proposal is introducing a `storage/replicate` package that would encapsulate the concerns of raft replication (e.g. log manipulation, snapshots, leader election, heartbeats, etc.). This `storage/apply` package will fit in nicely alongside a replication abstraction.
- Initial discussion on #38954 concluded that adding an optimization to acknowledge clients after their raft entries have committed but before they had been applied with the current code structure was moving in the opposite direction and making things even harder to understand due to the introduction of more complex state management.
- Recent instability in this area (#38976, #39064, #39135, #39203) has revealed that there exists a high degree of difficulty involved in testing any of the logic in the area of raft entry application. This has naturally led to testing at a distance using tools like testing hooks, which is frustrating and delicate. As a result, we're missing tests for things like old migrations that we still need to support. We also have trouble writing regression tests when bugs do pop up.
- The proposed optimization in #17500 (comment) to apply committed raft entries to the Replica storage engine asynchronously in a separate thread than the raft processing thread will make entry application significantly more complex. For instance, we'll likely need to introduce a separate scheduler to coordinate entry application passes across Ranges on a node. The schedule will likely want to prioritize leaders over followers and embed other policies to optimize for total system throughput. There's a strong desire to isolate this new complexity and to give the logic a place to live.

The PR begins to address these concerns by formalizing the process of applying committed raft entries. To start, this makes the process easier to understand both in terms of the macro-level steps that are taken during application of a batch of entries and in terms of the impact that an individual command has on the replicated state machine. For instance, the PR helps provide answers to all of the following questions:

- What are the stages of raft entry application?
- What is the difference between a "raft entry" and a "replicated command"?
- What can a command do besides apply its write batch to the storage engine?
- What does it mean for a successfully replicated command to be rejected during application?
- When can we acknowledge the outcome of a raft proposal?

The refactor also uncovers a large testing surface that future PRs will exploit to write targeted unit tests. Not only can the `storage/apply` package be tested with a mock state machine (done in this PR), but we can test Replica's implementation of the state machine interface in isolation without needing to touch raft at all.

Finally, the refactor paves the way for making the proposed change in #38954 in a much cleaner way. This is demonstrated in the second commit, which is being included here to show why certain things were designed the way they were but will not be merged with this PR.

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
ajwerner pushed a commit to ajwerner/cockroach that referenced this pull request Aug 15, 2019
The new package provides abstractions and routines associated with the
application of committed raft entries to a replicated state machine.

This was inspired by four driving forces:
- We've been having a number of discussions on the Core team about
  making storage abstractions more clear and easier to understand
  in isolation. One commonly discussed proposal is introducing a
  `storage/replicate` package that would encapsulate the concerns of
  raft replication (e.g. log manipulation, snapshots, leader election,
  heartbeats, etc.). This `storage/apply` package will fit in nicely
  alongside a replication abstraction.
- Initial discussion on cockroachdb#38954 concluded that adding an optimization
  to acknowledge clients after their raft entries have committed but
  before they had been applied with the current code structure was
  moving in the opposite direction and making things even harder to
  understand due to the introduction of more complex state management.
- Recent instability in this area (cockroachdb#38976, cockroachdb#39064, cockroachdb#39135, cockroachdb#39203) has
  revealed that there exists a high degree of difficulty involved in testing
  any of the logic in the area of raft entry application. This has naturally
  led to testing at a distance using tools like testing hooks, which is
  frustrating and delicate. As a result, we're missing tests for thing
  like old migrations that we still need to support. We also have trouble
  writing regression tests when bugs do pop up.
- The proposed optimization in cockroachdb#17500 (comment)
  to apply committed raft entries to the Replica storage engine asynchronously
  in a separate thread than the raft processing thread will make entry
  application significantly more complex. For instance, we'll likely
  need to introduce a separate scheduler to coordinate entry application
  passes across Ranges on a node. The schedule will likely want to
  prioritize leaders over followers and embed other policies to optimize
  for total system throughput. There's a strong desire to isolate this new
  complexity and to give the logic a place to live.

The PR begins to address these concerns by formalizing the process of
applying committed raft entries. To start, this makes the process easier
to understand both in terms of the macro-level steps that are taken during
application of a batch of entries and in terms of the impact that an individual
command has on the replicated state machine. For instance, the PR helps provide
answers to all of the following questions:

- What are the stages of raft entry application?
- What is the difference between a "raft entry" and a "replicated command"?
- What can a command do besides apply its write batch to the storage engine?
- What does it mean for a successfully replicated command to be rejected during application?
- When can we acknowledge the outcome of a raft proposal?

The refactor also uncovers a large testing surface that future PRs will exploit
to write targeted unit tests. Not only can the `storage/apply` package be tested
with a mock state machine (done in this PR), but we can test Replica's
implementation of the state machine interface in isolation without needing
to touch raft at all.

Finally, the refactor paves the way for making the proposed change in cockroachdb#38954
in a much cleaner way. This is demonstrated in next commit, which is being
included here to show why certain things were designed the way they were
but will not be merged with this PR.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

roachtest: cdc/ledger/rangefeed=true failed roachtest: cdc/tpcc-1000/rangefeed=true failed

4 participants