Skip to content

storage: initialize the proposalQuotaBaseIndex from Applied#39135

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
ajwerner:ajwerner/fix-quotapool-leak
Jul 30, 2019
Merged

storage: initialize the proposalQuotaBaseIndex from Applied#39135
craig[bot] merged 1 commit intocockroachdb:masterfrom
ajwerner:ajwerner/fix-quotapool-leak

Conversation

@ajwerner
Copy link
Copy Markdown
Contributor

This commit changes the initialization of proposalQuotaBaseIndex from
lastIndex which may include entries which are not yet committed to
status.Commit, the highest committed index. Given the
proposalQuotaBaseIndex should account for all committed proposals whose quota
has been released and proposals add their quota to the release queue after they
have been committed, it's important that the that base index not be too high
lest we leave quota in the queue.

This commit also adds an assertion that the proposalQuotaBaseIndex plus the
length of the queue does not exceed the current committed index.

See #39022 (comment)
for more details.

This change did not hit this assertion in 10 runs of an import of TPC-C whereas
without it, the assertion was hit roughly 30% of the time.

Fixes #39022.

Release note (bug fix): Properly initialize proposal quota tracking to prevent
quota leak which can hang imports or other AddSSTable operations.

@ajwerner ajwerner requested a review from a team July 27, 2019 01:23
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@ajwerner
Copy link
Copy Markdown
Contributor Author

This would benefit from a unit test but wanted to get it up first.

@ajwerner ajwerner force-pushed the ajwerner/fix-quotapool-leak branch 2 times, most recently from e9ef4c2 to b970648 Compare July 29, 2019 19:46
@ajwerner ajwerner changed the title storage: initialize the proposalQuotaBaseIndex from Commit storage: initialize the proposalQuotaBaseIndex from Applied Jul 30, 2019
@ajwerner ajwerner force-pushed the ajwerner/fix-quotapool-leak branch from b970648 to 43949c6 Compare July 30, 2019 02:54
@ajwerner
Copy link
Copy Markdown
Contributor Author

I've spent a few hours trying to come up with a unit test that reproduces the original failure on master and had a bad time. I've tried things which partition replicas and in general I'm finding that there are enough log entries when the partition is healed that the min index far outstrips the couple of queued entries which are not accounted for by the proposalQuotaBaseIndex being beyond the committed index. I also tried just tossing the lease around and and proposing entries concurrently. I'm starting to think I need to filter specific raft messages but am also questioning the point. If anybody has testing advice, I'm all ears.

@ajwerner ajwerner requested a review from nvb July 30, 2019 17:26
@ajwerner
Copy link
Copy Markdown
Contributor Author

@nvanbenschoten I'll keep banging on a regression test but in the meantime want to give this a look?

Copy link
Copy Markdown
Contributor

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 4 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @nvanbenschoten)


pkg/storage/helpers_test.go, line 270 at r1 (raw file):

	defer r.mu.Unlock()
	var appliedIndex uint64
	_ = r.withRaftGroupLocked(false, func(r *raft.RawNode) (unquiesceAndWakeLeader bool, err error) {

Let's do something with this error. Panicking is better than nothing.


pkg/storage/replica_application.go, line 181 at r1 (raw file):

	for ok := it.init(&b.cmdBuf); ok; ok = it.next() {
		cmd := it.cur()
		var toRelease int64

nit: I think it's generally considered good style to initialization zero values during variable declaration explicitly if they can be used without being overridden. s/var toRelease int64/toRelease := int64(0)/


pkg/storage/replica_application.go, line 182 at r1 (raw file):

		cmd := it.cur()
		var toRelease int64
		shouldRemove := anyLocal && cmd.proposedLocally() &&

nit: proposedLocally() => anyLocal, so I would only include the redundancy if calling proposedLocally() was expensive, but it's not.


pkg/storage/replica_proposal_quota.go, line 109 at r1 (raw file):

		if r.mu.replicaID == r.mu.leaderID {
			// We're becoming the leader.
			// Initialize the proposalQuotaBaseIndex as the current committed index.

Might as well take this time to explain why this is and how this relates to the assertion we make below. Something along the lines of "after the proposal quota is enabled, all entries applied by this replica will be added to the quotaReleaseQueue".


pkg/storage/replica_proposal_quota.go, line 224 at r1 (raw file):

		// quotaReleaseQueue when entries 'come out' of Raft via
		// raft.Ready.CommittedEntries. The minIndex computed above uses the
		// replica's commit index which is independent of whether or we've

"whether or not"


pkg/storage/replica_proposal_quota.go, line 231 at r1 (raw file):

		// quota releases.
		numReleases := minIndex - r.mu.proposalQuotaBaseIndex
		if qLen := uint64(len(r.mu.quotaReleaseQueue)); qLen < numReleases {

If we initialize minIndex to status.Applied instead of status.Commit, can we get rid of this case entirely?

This commit changes the initialization of `proposalQuotaBaseIndex` from
`lastIndex` which may include entries which are not yet committed to
`status.Applied`, the highest applied index. Given the
`proposalQuotaBaseIndex` should account for all committed proposals whose quota
has been released and proposals add their quota to the release queue after they
have been committed, it's important that the that base index not be too high
lest we leave quota in the queue.

This commit also adds an assertion that the `proposalQuotaBaseIndex` plus the
length of the queue is exactly equal to the applied index. In order to maintain
this invariant, the commit ensures that we enqueue a zero-value release to the
release queue for empty commands and commands proposed on another node.

See cockroachdb#39022 (comment)
for more details.

Fixes cockroachdb#39022.

Release note (bug fix): Properly initialize proposal quota tracking to prevent
quota leak which can hang imports or other AddSSTable operations.
@ajwerner ajwerner force-pushed the ajwerner/fix-quotapool-leak branch from 43949c6 to 0af5e44 Compare July 30, 2019 19:49
Copy link
Copy Markdown
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review, ready for another look.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)


pkg/storage/helpers_test.go, line 270 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Let's do something with this error. Panicking is better than nothing.

Sure, done.


pkg/storage/replica_application.go, line 181 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: I think it's generally considered good style to initialization zero values during variable declaration explicitly if they can be used without being overridden. s/var toRelease int64/toRelease := int64(0)/

Done.


pkg/storage/replica_proposal_quota.go, line 231 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

If we initialize minIndex to status.Applied instead of status.Commit, can we get rid of this case entirely?

Indeed, and the exposition that came with it.

Copy link
Copy Markdown
Contributor

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 4 of 4 files at r2.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained

@ajwerner
Copy link
Copy Markdown
Contributor Author

TFTR!

bors r=nvanbenschoten

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jul 30, 2019

Build failed

@ajwerner
Copy link
Copy Markdown
Contributor Author

[Lint] ccache: error: /home/roach/.ccache/ccache.conf: No such file or directory

Chalking it up as a build flake.

bors r=nvanbenschoten

craig bot pushed a commit that referenced this pull request Jul 30, 2019
39135: storage: initialize the proposalQuotaBaseIndex from Applied r=nvanbenschoten a=ajwerner

This commit changes the initialization of `proposalQuotaBaseIndex` from
`lastIndex` which may include entries which are not yet committed to
`status.Commit`, the highest committed index. Given the
`proposalQuotaBaseIndex` should account for all committed proposals whose quota
has been released and proposals add their quota to the release queue after they
have been committed, it's important that the that base index not be too high
lest we leave quota in the queue.

This commit also adds an assertion that the `proposalQuotaBaseIndex` plus the
length of the queue does not exceed the current committed index.

See #39022 (comment)
for more details.

This change did not hit this assertion in 10 runs of an import of TPC-C whereas
without it, the assertion was hit roughly 30% of the time.

Fixes #39022.

Release note (bug fix): Properly initialize proposal quota tracking to prevent
quota leak which can hang imports or other AddSSTable operations.

Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>
@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jul 30, 2019

Build succeeded

@craig craig bot merged commit 0af5e44 into cockroachdb:master Jul 30, 2019
nvb added a commit to nvb/cockroach that referenced this pull request Aug 2, 2019
The new package provides abstractions and routines associated with the
application of committed raft entries to a replicated state machine.

This was inspired by four driving forces:
- We've been having a number of discussions on the Core team about
  making storage abstractions more clear and easier to understand
  in isolation. One commonly discussed proposal is introducing a
  `storage/replicate` package that would encapsulate the concerns of
  raft replication (e.g. log manipulation, snapshots, leader election,
  heartbeats, etc.). This `storage/apply` package will fit in nicely
  alongside a replication abstraction.
- Initial discussion on cockroachdb#38954 concluded that adding an optimization
  to acknowledge clients after their raft entries have committed but
  before they had been applied with the current code structure was
  moving in the opposite direction and making things even harder to
  understand due to the introduction of more complex state management.
- Recent instability in this area (cockroachdb#38976, cockroachdb#39064, cockroachdb#39135, cockroachdb#39203) has
  revealed that there exists a high degree of difficulty involved in testing
  any of the logic in the area of raft entry application. This has naturally
  led to testing at a distance using tools like testing hooks, which is
  frustrating and delicate. As a result, we're missing tests for thing
  like old migrations that we still need to support. We also have trouble
  writing regression tests when bugs do pop up.
- The proposed optimization in cockroachdb#17500 (comment)
  to apply committed raft entries to the Replica storage engine asynchronously
  in a separate thread than the raft processing thread will make entry
  application significantly more complex. For instance, we'll likely
  need to introduce a separate scheduler to coordinate entry application
  passes across Ranges on a node. The schedule will likely want to
  prioritize leaders over followers and embed other policies to optimize
  for total system throughput. There's a strong desire to isolate this new
  complexity and to give the logic a place to live.

The PR begins to address these concerns by formalizing the process of
applying committed raft entries. To start, this makes the process easier
to understand both in terms of the macro-level steps that are taken during
application of a batch of entries and in terms of the impact that an individual
command has on the replicated state machine. For instance, the PR helps provide
answers to all of the following questions:

- What are the stages of raft entry application?
- What is the difference between a "raft entry" and a "replicated command"?
- What can a command do besides apply its write batch to the storage engine?
- What does it mean for a successfully replicated command to be rejected during application?
- When can we acknowledge the outcome of a raft proposal?

The refactor also uncovers a large testing surface that future PRs will exploit
to write targeted unit tests. Not only can the `storage/apply` package be tested
with a mock state machine (done in this PR), but we can test Replica's
implementation of the state machine interface in isolation without needing
to touch raft at all.

Finally, the refactor paves the way for making the proposed change in cockroachdb#38954
in a much cleaner way. This is demonstrated in next commit, which is being
included here to show why certain things were designed the way they were
but will not be merged with this PR.

Release note: None
ajwerner pushed a commit to ajwerner/cockroach that referenced this pull request Aug 5, 2019
util/quotapool is a generalization and optimization of the excised quotaPool.
Its somewhat different interface requires some code changes, specifically
around synchronization of modifications to the `ProposalData` struct.
Fortunately these changes combined with the logic changes in cockroachdb#39135
leave us with proposal quota tracking which never double frees and exactly
tracks outstanding proposals.

Release note: None
nvb added a commit to nvb/cockroach that referenced this pull request Aug 5, 2019
The new package provides abstractions and routines associated with the
application of committed raft entries to a replicated state machine.

This was inspired by four driving forces:
- We've been having a number of discussions on the Core team about
  making storage abstractions more clear and easier to understand
  in isolation. One commonly discussed proposal is introducing a
  `storage/replicate` package that would encapsulate the concerns of
  raft replication (e.g. log manipulation, snapshots, leader election,
  heartbeats, etc.). This `storage/apply` package will fit in nicely
  alongside a replication abstraction.
- Initial discussion on cockroachdb#38954 concluded that adding an optimization
  to acknowledge clients after their raft entries have committed but
  before they had been applied with the current code structure was
  moving in the opposite direction and making things even harder to
  understand due to the introduction of more complex state management.
- Recent instability in this area (cockroachdb#38976, cockroachdb#39064, cockroachdb#39135, cockroachdb#39203) has
  revealed that there exists a high degree of difficulty involved in testing
  any of the logic in the area of raft entry application. This has naturally
  led to testing at a distance using tools like testing hooks, which is
  frustrating and delicate. As a result, we're missing tests for thing
  like old migrations that we still need to support. We also have trouble
  writing regression tests when bugs do pop up.
- The proposed optimization in cockroachdb#17500 (comment)
  to apply committed raft entries to the Replica storage engine asynchronously
  in a separate thread than the raft processing thread will make entry
  application significantly more complex. For instance, we'll likely
  need to introduce a separate scheduler to coordinate entry application
  passes across Ranges on a node. The schedule will likely want to
  prioritize leaders over followers and embed other policies to optimize
  for total system throughput. There's a strong desire to isolate this new
  complexity and to give the logic a place to live.

The PR begins to address these concerns by formalizing the process of
applying committed raft entries. To start, this makes the process easier
to understand both in terms of the macro-level steps that are taken during
application of a batch of entries and in terms of the impact that an individual
command has on the replicated state machine. For instance, the PR helps provide
answers to all of the following questions:

- What are the stages of raft entry application?
- What is the difference between a "raft entry" and a "replicated command"?
- What can a command do besides apply its write batch to the storage engine?
- What does it mean for a successfully replicated command to be rejected during application?
- When can we acknowledge the outcome of a raft proposal?

The refactor also uncovers a large testing surface that future PRs will exploit
to write targeted unit tests. Not only can the `storage/apply` package be tested
with a mock state machine (done in this PR), but we can test Replica's
implementation of the state machine interface in isolation without needing
to touch raft at all.

Finally, the refactor paves the way for making the proposed change in cockroachdb#38954
in a much cleaner way. This is demonstrated in next commit, which is being
included here to show why certain things were designed the way they were
but will not be merged with this PR.

Release note: None
nvb added a commit to nvb/cockroach that referenced this pull request Aug 6, 2019
The new package provides abstractions and routines associated with the
application of committed raft entries to a replicated state machine.

This was inspired by four driving forces:
- We've been having a number of discussions on the Core team about
  making storage abstractions more clear and easier to understand
  in isolation. One commonly discussed proposal is introducing a
  `storage/replicate` package that would encapsulate the concerns of
  raft replication (e.g. log manipulation, snapshots, leader election,
  heartbeats, etc.). This `storage/apply` package will fit in nicely
  alongside a replication abstraction.
- Initial discussion on cockroachdb#38954 concluded that adding an optimization
  to acknowledge clients after their raft entries have committed but
  before they had been applied with the current code structure was
  moving in the opposite direction and making things even harder to
  understand due to the introduction of more complex state management.
- Recent instability in this area (cockroachdb#38976, cockroachdb#39064, cockroachdb#39135, cockroachdb#39203) has
  revealed that there exists a high degree of difficulty involved in testing
  any of the logic in the area of raft entry application. This has naturally
  led to testing at a distance using tools like testing hooks, which is
  frustrating and delicate. As a result, we're missing tests for thing
  like old migrations that we still need to support. We also have trouble
  writing regression tests when bugs do pop up.
- The proposed optimization in cockroachdb#17500 (comment)
  to apply committed raft entries to the Replica storage engine asynchronously
  in a separate thread than the raft processing thread will make entry
  application significantly more complex. For instance, we'll likely
  need to introduce a separate scheduler to coordinate entry application
  passes across Ranges on a node. The schedule will likely want to
  prioritize leaders over followers and embed other policies to optimize
  for total system throughput. There's a strong desire to isolate this new
  complexity and to give the logic a place to live.

The PR begins to address these concerns by formalizing the process of
applying committed raft entries. To start, this makes the process easier
to understand both in terms of the macro-level steps that are taken during
application of a batch of entries and in terms of the impact that an individual
command has on the replicated state machine. For instance, the PR helps provide
answers to all of the following questions:

- What are the stages of raft entry application?
- What is the difference between a "raft entry" and a "replicated command"?
- What can a command do besides apply its write batch to the storage engine?
- What does it mean for a successfully replicated command to be rejected during application?
- When can we acknowledge the outcome of a raft proposal?

The refactor also uncovers a large testing surface that future PRs will exploit
to write targeted unit tests. Not only can the `storage/apply` package be tested
with a mock state machine (done in this PR), but we can test Replica's
implementation of the state machine interface in isolation without needing
to touch raft at all.

Finally, the refactor paves the way for making the proposed change in cockroachdb#38954
in a much cleaner way. This is demonstrated in next commit, which is being
included here to show why certain things were designed the way they were
but will not be merged with this PR.

Release note: None
nvb added a commit to nvb/cockroach that referenced this pull request Aug 6, 2019
The new package provides abstractions and routines associated with the
application of committed raft entries to a replicated state machine.

This was inspired by four driving forces:
- We've been having a number of discussions on the Core team about
  making storage abstractions more clear and easier to understand
  in isolation. One commonly discussed proposal is introducing a
  `storage/replicate` package that would encapsulate the concerns of
  raft replication (e.g. log manipulation, snapshots, leader election,
  heartbeats, etc.). This `storage/apply` package will fit in nicely
  alongside a replication abstraction.
- Initial discussion on cockroachdb#38954 concluded that adding an optimization
  to acknowledge clients after their raft entries have committed but
  before they had been applied with the current code structure was
  moving in the opposite direction and making things even harder to
  understand due to the introduction of more complex state management.
- Recent instability in this area (cockroachdb#38976, cockroachdb#39064, cockroachdb#39135, cockroachdb#39203) has
  revealed that there exists a high degree of difficulty involved in testing
  any of the logic in the area of raft entry application. This has naturally
  led to testing at a distance using tools like testing hooks, which is
  frustrating and delicate. As a result, we're missing tests for thing
  like old migrations that we still need to support. We also have trouble
  writing regression tests when bugs do pop up.
- The proposed optimization in cockroachdb#17500 (comment)
  to apply committed raft entries to the Replica storage engine asynchronously
  in a separate thread than the raft processing thread will make entry
  application significantly more complex. For instance, we'll likely
  need to introduce a separate scheduler to coordinate entry application
  passes across Ranges on a node. The schedule will likely want to
  prioritize leaders over followers and embed other policies to optimize
  for total system throughput. There's a strong desire to isolate this new
  complexity and to give the logic a place to live.

The PR begins to address these concerns by formalizing the process of
applying committed raft entries. To start, this makes the process easier
to understand both in terms of the macro-level steps that are taken during
application of a batch of entries and in terms of the impact that an individual
command has on the replicated state machine. For instance, the PR helps provide
answers to all of the following questions:

- What are the stages of raft entry application?
- What is the difference between a "raft entry" and a "replicated command"?
- What can a command do besides apply its write batch to the storage engine?
- What does it mean for a successfully replicated command to be rejected during application?
- When can we acknowledge the outcome of a raft proposal?

The refactor also uncovers a large testing surface that future PRs will exploit
to write targeted unit tests. Not only can the `storage/apply` package be tested
with a mock state machine (done in this PR), but we can test Replica's
implementation of the state machine interface in isolation without needing
to touch raft at all.

Finally, the refactor paves the way for making the proposed change in cockroachdb#38954
in a much cleaner way. This is demonstrated in next commit, which is being
included here to show why certain things were designed the way they were
but will not be merged with this PR.

Release note: None
ajwerner pushed a commit to ajwerner/cockroach that referenced this pull request Aug 6, 2019
util/quotapool is a generalization and optimization of the excised quotaPool.
Its somewhat different interface requires some code changes, specifically
around synchronization of modifications to the `ProposalData` struct.
Fortunately these changes combined with the logic changes in cockroachdb#39135
leave us with proposal quota tracking which never double frees and exactly
tracks outstanding proposals.

Release note: None
nvb added a commit to nvb/cockroach that referenced this pull request Aug 6, 2019
The new package provides abstractions and routines associated with the
application of committed raft entries to a replicated state machine.

This was inspired by four driving forces:
- We've been having a number of discussions on the Core team about
  making storage abstractions more clear and easier to understand
  in isolation. One commonly discussed proposal is introducing a
  `storage/replicate` package that would encapsulate the concerns of
  raft replication (e.g. log manipulation, snapshots, leader election,
  heartbeats, etc.). This `storage/apply` package will fit in nicely
  alongside a replication abstraction.
- Initial discussion on cockroachdb#38954 concluded that adding an optimization
  to acknowledge clients after their raft entries have committed but
  before they had been applied with the current code structure was
  moving in the opposite direction and making things even harder to
  understand due to the introduction of more complex state management.
- Recent instability in this area (cockroachdb#38976, cockroachdb#39064, cockroachdb#39135, cockroachdb#39203) has
  revealed that there exists a high degree of difficulty involved in testing
  any of the logic in the area of raft entry application. This has naturally
  led to testing at a distance using tools like testing hooks, which is
  frustrating and delicate. As a result, we're missing tests for thing
  like old migrations that we still need to support. We also have trouble
  writing regression tests when bugs do pop up.
- The proposed optimization in cockroachdb#17500 (comment)
  to apply committed raft entries to the Replica storage engine asynchronously
  in a separate thread than the raft processing thread will make entry
  application significantly more complex. For instance, we'll likely
  need to introduce a separate scheduler to coordinate entry application
  passes across Ranges on a node. The schedule will likely want to
  prioritize leaders over followers and embed other policies to optimize
  for total system throughput. There's a strong desire to isolate this new
  complexity and to give the logic a place to live.

The PR begins to address these concerns by formalizing the process of
applying committed raft entries. To start, this makes the process easier
to understand both in terms of the macro-level steps that are taken during
application of a batch of entries and in terms of the impact that an individual
command has on the replicated state machine. For instance, the PR helps provide
answers to all of the following questions:

- What are the stages of raft entry application?
- What is the difference between a "raft entry" and a "replicated command"?
- What can a command do besides apply its write batch to the storage engine?
- What does it mean for a successfully replicated command to be rejected during application?
- When can we acknowledge the outcome of a raft proposal?

The refactor also uncovers a large testing surface that future PRs will exploit
to write targeted unit tests. Not only can the `storage/apply` package be tested
with a mock state machine (done in this PR), but we can test Replica's
implementation of the state machine interface in isolation without needing
to touch raft at all.

Finally, the refactor paves the way for making the proposed change in cockroachdb#38954
in a much cleaner way. This is demonstrated in next commit, which is being
included here to show why certain things were designed the way they were
but will not be merged with this PR.

Release note: None
nvb added a commit to nvb/cockroach that referenced this pull request Aug 7, 2019
The new package provides abstractions and routines associated with the
application of committed raft entries to a replicated state machine.

This was inspired by four driving forces:
- We've been having a number of discussions on the Core team about
  making storage abstractions more clear and easier to understand
  in isolation. One commonly discussed proposal is introducing a
  `storage/replicate` package that would encapsulate the concerns of
  raft replication (e.g. log manipulation, snapshots, leader election,
  heartbeats, etc.). This `storage/apply` package will fit in nicely
  alongside a replication abstraction.
- Initial discussion on cockroachdb#38954 concluded that adding an optimization
  to acknowledge clients after their raft entries have committed but
  before they had been applied with the current code structure was
  moving in the opposite direction and making things even harder to
  understand due to the introduction of more complex state management.
- Recent instability in this area (cockroachdb#38976, cockroachdb#39064, cockroachdb#39135, cockroachdb#39203) has
  revealed that there exists a high degree of difficulty involved in testing
  any of the logic in the area of raft entry application. This has naturally
  led to testing at a distance using tools like testing hooks, which is
  frustrating and delicate. As a result, we're missing tests for thing
  like old migrations that we still need to support. We also have trouble
  writing regression tests when bugs do pop up.
- The proposed optimization in cockroachdb#17500 (comment)
  to apply committed raft entries to the Replica storage engine asynchronously
  in a separate thread than the raft processing thread will make entry
  application significantly more complex. For instance, we'll likely
  need to introduce a separate scheduler to coordinate entry application
  passes across Ranges on a node. The schedule will likely want to
  prioritize leaders over followers and embed other policies to optimize
  for total system throughput. There's a strong desire to isolate this new
  complexity and to give the logic a place to live.

The PR begins to address these concerns by formalizing the process of
applying committed raft entries. To start, this makes the process easier
to understand both in terms of the macro-level steps that are taken during
application of a batch of entries and in terms of the impact that an individual
command has on the replicated state machine. For instance, the PR helps provide
answers to all of the following questions:

- What are the stages of raft entry application?
- What is the difference between a "raft entry" and a "replicated command"?
- What can a command do besides apply its write batch to the storage engine?
- What does it mean for a successfully replicated command to be rejected during application?
- When can we acknowledge the outcome of a raft proposal?

The refactor also uncovers a large testing surface that future PRs will exploit
to write targeted unit tests. Not only can the `storage/apply` package be tested
with a mock state machine (done in this PR), but we can test Replica's
implementation of the state machine interface in isolation without needing
to touch raft at all.

Finally, the refactor paves the way for making the proposed change in cockroachdb#38954
in a much cleaner way. This is demonstrated in next commit, which is being
included here to show why certain things were designed the way they were
but will not be merged with this PR.

Release note: None
craig bot pushed a commit that referenced this pull request Aug 7, 2019
39254: storage/apply: create apply package for raft entry application r=nvanbenschoten a=nvanbenschoten

The new package provides abstractions and routines associated with the application of committed raft entries to a replicated state machine.

This was inspired by four driving forces:
- We've been having a number of discussions on the Core team about making storage abstractions more clear and easier to understand in isolation. One commonly discussed proposal is introducing a `storage/replicate` package that would encapsulate the concerns of raft replication (e.g. log manipulation, snapshots, leader election, heartbeats, etc.). This `storage/apply` package will fit in nicely alongside a replication abstraction.
- Initial discussion on #38954 concluded that adding an optimization to acknowledge clients after their raft entries have committed but before they had been applied with the current code structure was moving in the opposite direction and making things even harder to understand due to the introduction of more complex state management.
- Recent instability in this area (#38976, #39064, #39135, #39203) has revealed that there exists a high degree of difficulty involved in testing any of the logic in the area of raft entry application. This has naturally led to testing at a distance using tools like testing hooks, which is frustrating and delicate. As a result, we're missing tests for things like old migrations that we still need to support. We also have trouble writing regression tests when bugs do pop up.
- The proposed optimization in #17500 (comment) to apply committed raft entries to the Replica storage engine asynchronously in a separate thread than the raft processing thread will make entry application significantly more complex. For instance, we'll likely need to introduce a separate scheduler to coordinate entry application passes across Ranges on a node. The schedule will likely want to prioritize leaders over followers and embed other policies to optimize for total system throughput. There's a strong desire to isolate this new complexity and to give the logic a place to live.

The PR begins to address these concerns by formalizing the process of applying committed raft entries. To start, this makes the process easier to understand both in terms of the macro-level steps that are taken during application of a batch of entries and in terms of the impact that an individual command has on the replicated state machine. For instance, the PR helps provide answers to all of the following questions:

- What are the stages of raft entry application?
- What is the difference between a "raft entry" and a "replicated command"?
- What can a command do besides apply its write batch to the storage engine?
- What does it mean for a successfully replicated command to be rejected during application?
- When can we acknowledge the outcome of a raft proposal?

The refactor also uncovers a large testing surface that future PRs will exploit to write targeted unit tests. Not only can the `storage/apply` package be tested with a mock state machine (done in this PR), but we can test Replica's implementation of the state machine interface in isolation without needing to touch raft at all.

Finally, the refactor paves the way for making the proposed change in #38954 in a much cleaner way. This is demonstrated in the second commit, which is being included here to show why certain things were designed the way they were but will not be merged with this PR.

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
ajwerner pushed a commit to ajwerner/cockroach that referenced this pull request Aug 8, 2019
util/quotapool is a generalization and optimization of the excised quotaPool.
Its somewhat different interface requires some code changes, specifically
around synchronization of modifications to the `ProposalData` struct.
Fortunately these changes combined with the logic changes in cockroachdb#39135
leave us with proposal quota tracking which never double frees and exactly
tracks outstanding proposals (modulo proposals which commit but are
re-proposed).

Release note: None
ajwerner pushed a commit to ajwerner/cockroach that referenced this pull request Aug 13, 2019
util/quotapool is a generalization and optimization of the excised quotaPool.
Its somewhat different interface requires some code changes, specifically
around synchronization of modifications to the `ProposalData` struct.
Fortunately these changes combined with the logic changes in cockroachdb#39135
leave us with proposal quota tracking which never double frees and exactly
tracks outstanding proposals (modulo proposals which commit but are
re-proposed).

Release note: None
ajwerner pushed a commit to ajwerner/cockroach that referenced this pull request Aug 13, 2019
util/quotapool is a generalization and optimization of the excised quotaPool.
Its somewhat different interface requires some code changes, specifically
around synchronization of modifications to the `ProposalData` struct.
Fortunately these changes combined with the logic changes in cockroachdb#39135
leave us with proposal quota tracking which never double frees and exactly
tracks outstanding proposals (modulo proposals which commit but are
re-proposed).

Release note: None
ajwerner pushed a commit to ajwerner/cockroach that referenced this pull request Aug 13, 2019
util/quotapool is a generalization and optimization of the excised quotaPool.
Its somewhat different interface requires some code changes, specifically
around synchronization of modifications to the `ProposalData` struct.
Fortunately these changes combined with the logic changes in cockroachdb#39135
leave us with proposal quota tracking which never double frees and exactly
tracks outstanding proposals (modulo proposals which commit but are
re-proposed).

Release note: None
ajwerner pushed a commit to ajwerner/cockroach that referenced this pull request Aug 15, 2019
The new package provides abstractions and routines associated with the
application of committed raft entries to a replicated state machine.

This was inspired by four driving forces:
- We've been having a number of discussions on the Core team about
  making storage abstractions more clear and easier to understand
  in isolation. One commonly discussed proposal is introducing a
  `storage/replicate` package that would encapsulate the concerns of
  raft replication (e.g. log manipulation, snapshots, leader election,
  heartbeats, etc.). This `storage/apply` package will fit in nicely
  alongside a replication abstraction.
- Initial discussion on cockroachdb#38954 concluded that adding an optimization
  to acknowledge clients after their raft entries have committed but
  before they had been applied with the current code structure was
  moving in the opposite direction and making things even harder to
  understand due to the introduction of more complex state management.
- Recent instability in this area (cockroachdb#38976, cockroachdb#39064, cockroachdb#39135, cockroachdb#39203) has
  revealed that there exists a high degree of difficulty involved in testing
  any of the logic in the area of raft entry application. This has naturally
  led to testing at a distance using tools like testing hooks, which is
  frustrating and delicate. As a result, we're missing tests for thing
  like old migrations that we still need to support. We also have trouble
  writing regression tests when bugs do pop up.
- The proposed optimization in cockroachdb#17500 (comment)
  to apply committed raft entries to the Replica storage engine asynchronously
  in a separate thread than the raft processing thread will make entry
  application significantly more complex. For instance, we'll likely
  need to introduce a separate scheduler to coordinate entry application
  passes across Ranges on a node. The schedule will likely want to
  prioritize leaders over followers and embed other policies to optimize
  for total system throughput. There's a strong desire to isolate this new
  complexity and to give the logic a place to live.

The PR begins to address these concerns by formalizing the process of
applying committed raft entries. To start, this makes the process easier
to understand both in terms of the macro-level steps that are taken during
application of a batch of entries and in terms of the impact that an individual
command has on the replicated state machine. For instance, the PR helps provide
answers to all of the following questions:

- What are the stages of raft entry application?
- What is the difference between a "raft entry" and a "replicated command"?
- What can a command do besides apply its write batch to the storage engine?
- What does it mean for a successfully replicated command to be rejected during application?
- When can we acknowledge the outcome of a raft proposal?

The refactor also uncovers a large testing surface that future PRs will exploit
to write targeted unit tests. Not only can the `storage/apply` package be tested
with a mock state machine (done in this PR), but we can test Replica's
implementation of the state machine interface in isolation without needing
to touch raft at all.

Finally, the refactor paves the way for making the proposed change in cockroachdb#38954
in a much cleaner way. This is demonstrated in next commit, which is being
included here to show why certain things were designed the way they were
but will not be merged with this PR.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

storage: proposal quota leak

3 participants