[WIP] storage: proposal quota pool by petermattis · Pull Request #13869 · cockroachdb/cockroach

petermattis · 2017-03-01T12:25:22Z

This change is

petermattis · 2017-03-01T12:26:30Z

@spencerkimball Per our discussion yesterday. This isn't fully fleshed out right now, but gives us something concrete to argue over.

petermattis · 2017-03-01T13:02:08Z

This seems to work in simple testing. A node crash will cause a 2 second blip of unavailability if the range is very active. In normal operation, we'll limit proposals to the speed of the slowest replica, but with a buffer to avoid imposing the latency of the slower replicas on the commit latency.

Cc @bdarnell

spencerkimball

Overall this is pretty simple, which is positive. As mentioned in the review comments, I think the biggest concern would be a replica coming back online and wreaking havoc.

My earlier thought to avoid the 2s timeout seems less useful now that I'm looking at this. It would just push the problem I was looking to solve elsewhere – we'd be relying on node liveness instead of this 2s timeout.

Need unittests too, obviously.

spencerkimball · 2017-03-01T13:09:58Z

pkg/storage/quota_pool.go

+)
+
+const (
+	leaderProposalQuota   = 1000


Do we need something more thoughtful here?

Probably. We'd like something that avoided Raft log truncation. This achieves that for the workload I'm using which involves small requests. More directly estimating the affect on Raft log size would be nice, though I'm not sure how to achieve that.

spencerkimball · 2017-03-01T13:10:38Z

pkg/storage/quota_pool.go

+type quotaPool struct {
+	c chan int64
+
+	mu    sync.Mutex


Any reason not to use atomic primitives?

This code was lifted from grpc. I haven't thought through the difficulties with using atomics here.

yep, could be tricky. You'll have to scrutinize the code carefully.

spencerkimball · 2017-03-01T13:11:17Z

pkg/storage/quota_pool.go

+}
+
+func newQuotaPool(q int64) *quotaPool {
+	qb := &quotaPool{


Why is this qb, not qp?

Cause I stole it from the grpc code base which uses the name qb. I'll change it eventually.

spencerkimball · 2017-03-01T13:27:53Z

pkg/storage/replica.go

+			r.mu.proposalQuotaBase = r.mu.lastIndex
+			r.proposalQuota.reset(leaderProposalQuota)
+		} else {
+			r.proposalQuota.reset(followerProposalQuota)


Why do followers require a proposal quota?

So they can propose requests which get forwarded to the leader. For example, followers propose lease requests which is how they determine who holds the lease. We could probably change this so that we don't acquire proposal quota when a follower proposes a Raft command.

I'd just as soon eliminate the follower's use of a quota pool and the associated constant and just set the pool to nil on a leader -> follower transition, and allow followers to always propose and forward to the leader.

spencerkimball · 2017-03-01T13:31:00Z

pkg/storage/replica.go

 		// Computed checksum at a snapshot UUID.
 		checksums map[uuid.UUID]replicaChecksum

+		proposalQuotaBase uint64


s/proposalQuotaBase/proposalQuotaBaseIndex/

spencerkimball · 2017-03-01T13:33:33Z

pkg/storage/store.go

 		if req.Message.Type == raftpb.MsgApp {
 			r.setEstimatedCommitIndexLocked(req.Message.Commit)
 		}
+		r.setLastActivityLocked(req.FromReplica.ReplicaID)


Is it possible for a hopelessly lagged replica to come back online and require a snapshot, but still report its log index and cause the quota pool to be immediately drained to a negative value?

No, that's not possible. The purpose of Replica.mu.proposalQuotaBaseIndex is that the index used for quota calculations ratchets up. If a replica comes back on line and has a commit index before the quota base index, it will be ignored for quota purposes until it catches back up.

Actually, this was broken before, but should be fixed now.

spencerkimball · 2017-03-01T13:37:16Z

pkg/storage/replica.go

+	for _, rep := range r.mu.state.Desc.Replicas {
+		// Only consider followers that we've received a message from in the last 2
+		// seconds.
+		const activeTime = 2 * time.Second


Probably need to put this in StoreConfig.

Yep. I'll add a TODO for now.

petermattis

Yes, unit tests are definitely needed before merging.

petermattis · 2017-03-01T14:01:27Z

pkg/storage/quota_pool.go

+)
+
+const (
+	leaderProposalQuota   = 1000


Probably. We'd like something that avoided Raft log truncation. This achieves that for the workload I'm using which involves small requests. More directly estimating the affect on Raft log size would be nice, though I'm not sure how to achieve that.

petermattis · 2017-03-01T14:01:53Z

pkg/storage/quota_pool.go

+}
+
+func newQuotaPool(q int64) *quotaPool {
+	qb := &quotaPool{


Cause I stole it from the grpc code base which uses the name qb. I'll change it eventually.

petermattis · 2017-03-01T14:02:20Z

pkg/storage/replica.go

 		// Computed checksum at a snapshot UUID.
 		checksums map[uuid.UUID]replicaChecksum

+		proposalQuotaBase uint64


petermattis · 2017-03-01T14:02:46Z

pkg/storage/replica.go

+	for _, rep := range r.mu.state.Desc.Replicas {
+		// Only consider followers that we've received a message from in the last 2
+		// seconds.
+		const activeTime = 2 * time.Second


Yep. I'll add a TODO for now.

petermattis · 2017-03-01T14:03:33Z

pkg/storage/replica.go

+			r.mu.proposalQuotaBase = r.mu.lastIndex
+			r.proposalQuota.reset(leaderProposalQuota)
+		} else {
+			r.proposalQuota.reset(followerProposalQuota)


So they can propose requests which get forwarded to the leader. For example, followers propose lease requests which is how they determine who holds the lease. We could probably change this so that we don't acquire proposal quota when a follower proposes a Raft command.

petermattis · 2017-03-01T14:05:21Z

pkg/storage/store.go

 		if req.Message.Type == raftpb.MsgApp {
 			r.setEstimatedCommitIndexLocked(req.Message.Commit)
 		}
+		r.setLastActivityLocked(req.FromReplica.ReplicaID)


No, that's not possible. The purpose of Replica.mu.proposalQuotaBaseIndex is that the index used for quota calculations ratchets up. If a replica comes back on line and has a commit index before the quota base index, it will be ignored for quota purposes until it catches back up.

petermattis · 2017-03-01T14:08:03Z

pkg/storage/quota_pool.go

+type quotaPool struct {
+	c chan int64
+
+	mu    sync.Mutex


This code was lifted from grpc. I haven't thought through the difficulties with using atomics here.

spencerkimball · 2017-03-01T14:17:09Z

Aside from unittests, LGTM

petermattis · 2017-03-01T19:23:22Z

I reworked the handling of quotaPool so that it uses Replica.mu instead of its own internal lock and I changed the code so that there is only a non-nil quotaPool on the leader.

petermattis · 2017-03-01T19:26:31Z

Review status: 0 of 3 files reviewed at latest revision, 7 unresolved discussions, some commit checks pending.

pkg/storage/replica.go, line 2802 at r1 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

I'd just as soon eliminate the follower's use of a quota pool and the associated constant and just set the pool to nil on a leader -> follower transition, and allow followers to always propose and forward to the leader.

Done.

Comments from Reviewable

petermattis · 2017-03-01T19:26:52Z

Review status: 0 of 3 files reviewed at latest revision, 7 unresolved discussions, some commit checks pending.

pkg/storage/quota_pool.go, line 18 at r1 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

yep, could be tricky. You'll have to scrutinize the code carefully.

Reworked to remove the mutex.

pkg/storage/quota_pool.go, line 23 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Cause I stole it from the grpc code base which uses the name qb. I'll change it eventually.

Done.

Comments from Reviewable

petermattis · 2017-03-01T19:53:59Z

Review status: 0 of 3 files reviewed at latest revision, 7 unresolved discussions, some commit checks pending.

pkg/storage/quota_pool.go, line 18 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Reworked to remove the mutex.

Doh! Removing the mutex clearly wasn't safe. Reintroduced for now.

Comments from Reviewable

bdarnell · 2017-03-05T23:44:18Z

Reviewed 1 of 3 files at r1, 2 of 2 files at r2.
Review status: all files reviewed at latest revision, 16 unresolved discussions, some commit checks failed.

pkg/storage/quota_pool.go, line 75 at r2 (raw file):

		qp.c <- q
	} else {
		qp.quota = q

Is q always zero here (in which case this assignment is redundant) or are negative values allowed?

pkg/storage/quota_pool.go, line 101 at r2 (raw file):

// acquires acquires a single unit of quota from the pool. Safe for concurrent
// use.

Document what is required of the caller: if acquire() returns without error, the caller must call add(1) after doing its work.

pkg/storage/replica.go, line 825 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Yep. I'll add a TODO for now.

We have a lot of different failure detectors, with different thresholds and even different mechanisms. Can this piggyback on another (probably NodeLiveness) instead of introducing a new way for a node to be "down"?

A failing node already triggers a blip of up to 9s (?) for ranges it leads, so it doesn't seem bad to me to lengthen the impact of a failing node on ranges where it is a follower which are also maxing out their proposal quota.

pkg/storage/replica.go, line 812 at r2 (raw file):

		r.mu.lastActivity = make(map[roachpb.ReplicaID]time.Time)
	}
	r.mu.lastActivity[replicaID] = timeutil.Now()

We should also clear from this map at some point - at least when removing a replica, and maybe also empty the whole map when leadership changes hands?

pkg/storage/replica.go, line 825 at r2 (raw file):

}

func (r *Replica) refreshProposalQuotaLocked(newLeaderID roachpb.ReplicaID) {

s/refresh/update/?

pkg/storage/replica.go, line 840 at r2 (raw file):

		return
	}
	// We're the leader.

Add "still" to this comment to distinguish it from the early return above when we're becoming the leader.

pkg/storage/replica.go, line 842 at r2 (raw file):

	// We're the leader.

	// TODO(peter): Can we avoid retrieving the Raft status on every invocation?

I assume your concern is for the allocations this performs? We could probably add a method to query this without additional allocations. We could also maintain our own map by tracking MsgAppResps as they pass through the transport, but that probably duplicates too much raft logic.

pkg/storage/replica.go, line 844 at r2 (raw file):

	// TODO(peter): Can we avoid retrieving the Raft status on every invocation?
	status := r.raftStatusRLocked()
	// Find the minimum index that active followers have committed.

s/committed/acknowledged/

pkg/storage/replica.go, line 857 at r2 (raw file):

		}
		if progress, ok := status.Progress[uint64(rep.ReplicaID)]; ok {
			// Only consider follower's who are in advance of the quota base

s/'//

pkg/storage/replica.go, line 870 at r2 (raw file):

	if r.mu.proposalQuotaBaseIndex < minIndex {
		delta := int64(minIndex - r.mu.proposalQuotaBaseIndex)

Add a comment pointing out this subtlety: Raft may propose commands itself (specifically the empty commands when leadership changes), and these commands don't go through the code paths where we acquire quota from the pool. We avoid releasing quota here that we never acquired by resetting the quota pool whenever leadership changes hands.

Comments from Reviewable

petermattis · 2017-03-06T02:13:00Z

I still need to add tests.

Review status: 1 of 3 files reviewed at latest revision, 16 unresolved discussions.

pkg/storage/quota_pool.go, line 75 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Is q always zero here (in which case this assignment is redundant) or are negative values allowed?

We always pass in a positive value to newQuotaPool. I can simplify this.

pkg/storage/quota_pool.go, line 101 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Document what is required of the caller: if acquire() returns without error, the caller must call add(1) after doing its work.

Done.

pkg/storage/replica.go, line 825 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

We have a lot of different failure detectors, with different thresholds and even different mechanisms. Can this piggyback on another (probably NodeLiveness) instead of introducing a new way for a node to be "down"?

A failing node already triggers a blip of up to 9s (?) for ranges it leads, so it doesn't seem bad to me to lengthen the impact of a failing node on ranges where it is a follower which are also maxing out their proposal quota.

I initially piggybacked this on node liveness, but then got worried that the node liveness ranges would have to be exempted in order to avoid deadlock. Seemed simpler to have a different mechanism here.

pkg/storage/replica.go, line 812 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

We should also clear from this map at some point - at least when removing a replica, and maybe also empty the whole map when leadership changes hands?

Done. We now only initialize this map on the leader and clear it when a node becomes a follower.

pkg/storage/replica.go, line 825 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

s/refresh/update/?

Done.

pkg/storage/replica.go, line 840 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Add "still" to this comment to distinguish it from the early return above when we're becoming the leader.

Done.

pkg/storage/replica.go, line 842 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I assume your concern is for the allocations this performs? We could probably add a method to query this without additional allocations. We could also maintain our own map by tracking MsgAppResps as they pass through the transport, but that probably duplicates too much raft logic.

Yeah, the allocation is the concern. It is a small concern, though. Trying to track this ourselves via MsgAppResp seems like overkill. I've enhanced the TODO comment.

pkg/storage/replica.go, line 844 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

s/committed/acknowledged/

Done.

pkg/storage/replica.go, line 857 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

s/'//

Done.

pkg/storage/replica.go, line 870 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Add a comment pointing out this subtlety: Raft may propose commands itself (specifically the empty commands when leadership changes), and these commands don't go through the code paths where we acquire quota from the pool. We avoid releasing quota here that we never acquired by resetting the quota pool whenever leadership changes hands.

Done.

Comments from Reviewable

petermattis · 2017-03-06T02:16:13Z

Review status: 1 of 3 files reviewed at latest revision, 16 unresolved discussions, some commit checks pending.

pkg/storage/replica.go, line 825 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I initially piggybacked this on node liveness, but then got worried that the node liveness ranges would have to be exempted in order to avoid deadlock. Seemed simpler to have a different mechanism here.

Rather than introducing a new configurable here, perhaps this time should be the same as the Raft election timeout. The only way for a leader to have not heard for a Replica for longer than the Raft election timeout is for the Range to be quiescent in which case there are no concerns about proposal quota.

Comments from Reviewable

petermattis · 2017-03-06T02:17:28Z

Review status: 1 of 3 files reviewed at latest revision, 16 unresolved discussions, some commit checks pending.

pkg/storage/replica.go, line 825 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Rather than introducing a new configurable here, perhaps this time should be the same as the Raft election timeout. The only way for a leader to have not heard for a Replica for longer than the Raft election timeout is for the Range to be quiescent in which case there are no concerns about proposal quota.

Err, The only way is not correct. The primary way? A leader might not have heard from a replica if that replica is on a down/partitioned node.

Comments from Reviewable

bdarnell · 2017-03-06T16:34:07Z

Reviewed 2 of 2 files at r3.
Review status: all files reviewed at latest revision, 8 unresolved discussions, some commit checks failed.

pkg/storage/replica.go, line 825 at r1 (raw file):

but then got worried that the node liveness ranges would have to be exempted in order to avoid deadlock

OK, what about RPC heartbeats then? I'd really like to avoid introducing a new notion of node activity, especially one with subtle interactions with quiescence and leadership changes.

pkg/storage/replica.go, line 812 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Done. We now only initialize this map on the leader and clear it when a node becomes a follower.

So a brand-new leader will see all the followers as inactive. That seems less than ideal, although since we're also resetting the quota to its maximum limit when becoming leader I don't see a specific problem with it.

Comments from Reviewable

petermattis · 2017-03-06T17:49:59Z

Review status: 1 of 2 files reviewed at latest revision, 8 unresolved discussions.

pkg/storage/replica.go, line 825 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

but then got worried that the node liveness ranges would have to be exempted in order to avoid deadlock

OK, what about RPC heartbeats then? I'd really like to avoid introducing a new notion of node activity, especially one with subtle interactions with quiescence and leadership changes.

My instinct is that the new mechanism would be less fragile than reusing an existing one like RPC heartbeats. Not sure why I feel that way. I'll think about this more. That said, I added another commit which replaces lastActivity with RPC connection health. The various lookups involved are mildly concerning for performance.

Comments from Reviewable

The leader maintains a pool of "proposal quota". Before proposing a Raft command, we acquire 1 unit of proposal quota. When all of the active followers have committed an entry, that unit of proposal quota is returned to the pool. The proposal quota pool size is hard coded to 1000 which allows fairly deep pipelining of Raft commands. We only consider "active" followers when determining if a unit of quota should be returned to the pool. An active follower is one we've received any type of message from in the past 2 seconds. See cockroachdb#8659

petermattis · 2017-05-01T19:56:27Z

Cc @irfansharif

Repurposing cockroachdb#13869. The leader maintains a pool of "proposal quota". Before proposing a Raft command, we acquire 1 unit of proposal quota. When all of the healthy followers have committed an entry, that unit of proposal quota is returned to the pool. The proposal quota pool size is hard coded to 1000 which allows fairly deep pipelining of Raft commands. We only consider followers that have "healthy" RPC connections when determining if a unit of quota should be returned to the pool.

petermattis force-pushed the pmattis/proposal-quota branch 2 times, most recently from dd3f1cf to 6e031a0 Compare March 1, 2017 12:56

spencerkimball reviewed Mar 1, 2017

View reviewed changes

petermattis commented Mar 1, 2017

View reviewed changes

petermattis force-pushed the pmattis/proposal-quota branch 2 times, most recently from ec6f53a to e5c6396 Compare March 1, 2017 19:20

petermattis force-pushed the pmattis/proposal-quota branch 2 times, most recently from 7044091 to fd1947d Compare March 1, 2017 19:44

petermattis force-pushed the pmattis/proposal-quota branch from fd1947d to 813d7bc Compare March 2, 2017 12:35

petermattis force-pushed the pmattis/proposal-quota branch from 813d7bc to 7369586 Compare March 6, 2017 02:11

petermattis force-pushed the pmattis/proposal-quota branch from 7369586 to ddc6ef8 Compare March 6, 2017 17:48

petermattis added 2 commits March 27, 2017 11:27

storage: replace lastActivity with RPC heartbeats

f16f2db

petermattis force-pushed the pmattis/proposal-quota branch from ddc6ef8 to f16f2db Compare March 27, 2017 15:33

knz added the do-not-merge bors won't merge a PR with this label. label Apr 25, 2017

bdarnell mentioned this pull request May 8, 2017

high memory usage in cluster #15702

Closed

irfansharif mentioned this pull request May 9, 2017

storage: flow control throttling replica operations #15802

Merged

petermattis closed this May 30, 2017

petermattis deleted the pmattis/proposal-quota branch May 30, 2017 15:44

Conversation

petermattis commented Mar 1, 2017 • edited by rjnn Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

petermattis commented Mar 1, 2017

Uh oh!

petermattis commented Mar 1, 2017

Uh oh!

spencerkimball left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

petermattis left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

spencerkimball commented Mar 1, 2017

Uh oh!

petermattis commented Mar 1, 2017

Uh oh!

petermattis commented Mar 1, 2017

Uh oh!

petermattis commented Mar 1, 2017

Uh oh!

petermattis commented Mar 1, 2017

Uh oh!

bdarnell commented Mar 5, 2017

Uh oh!

petermattis commented Mar 6, 2017

Uh oh!

petermattis commented Mar 6, 2017

Uh oh!

petermattis commented Mar 1, 2017 •

edited by rjnn

Loading