kvserver,kvflowcontrol: integrate flow control by irfansharif · Pull Request #98308 · cockroachdb/cockroach

irfansharif · 2023-03-09T16:31:03Z

Part of #95563. See individual commits.

Release note: None

cockroach-teamcity · 2023-03-09T16:31:14Z

This change is

sumeerbhola

I did a rough pass. Just some small comments.

Reviewed 2 of 37 files at r26, 6 of 74 files at r34, 3 of 66 files at r35, 1 of 29 files at r37, 7 of 63 files at r38, 12 of 45 files at r39, 4 of 18 files at r41, 1 of 2 files at r42.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @irfansharif, @pavelkalinnikov, @smg260, and @tbg)

-- commits line 106 at r42:
nit: two settings?

pkg/kv/kvserver/flow_control_replica_integration.go line 88 at r42 (raw file):

	// tied to the lifetime of a leaseholder replica having raft leadership. We
	// don't intercept lease acquisitions/transfers -- simply raft leadership.
	// This is ok since leadership follows the lease.

We may have discussed this before, but I can't remember -- is the idea here that we don't worry about not being leaseholder as that is a benign situation in that since we won't evaluate proposals at this replica we won't deduct any tokens, so the flow control stuff will go unused. And once we also become the leaseholder they will start getting exercised? If yes, this could use a code comment since "is ok since leadership follows the lease" doesn't give much confidence to a reader about correctness in the intermediate states when they are not colocated.

pkg/kv/kvserver/flow_control_replica_integration.go line 99 at r42 (raw file):

	localRepl, found := f.lastKnownReplicas.GetReplicaDescriptorByID(f.replicaForFlowControl.getReplicaID())
	if !found {
		log.Fatalf(ctx, "leader (replid=%d) didn't find self in last known replicas (%s)",

Is this assertion relying on the fact that replicaForFlowControl is locked, so the descriptor could not have changed state while this callback is ongoing?

pkg/kv/kvserver/flow_control_replica_integration.go line 170 at r42 (raw file):

			// We're observing ourselves get removed from the raft group, but
			// are still retaining raft leadership. Close the underlying handle
			// and bail.

Is there a reason this is not an assertion failure like the previous case. I think we need a code comment explaining why some things are tolerated and some are not -- it's making me slightly nervous about this code model where the notification name tells us something (like onBecomeLeader) and then we fetch some state inside that notification and expect it to be consistent with what we were told.

pkg/kv/kvserver/flow_control_replica_integration.go line 232 at r42 (raw file):

	for _, repl := range f.lastKnownReplicas.Descriptors() {
		if repl.ReplicaID == ourReplicaID {
			continue

should we assert that ourReplicaID is not in disconnectedStores?

pkg/kv/kvserver/flow_control_replica_integration.go line 318 at r42 (raw file):

		if _, found := pausedFollowers[repl.ReplicaID]; found {
			// As of 4/23, we don't make any strong guarantees around the set of

who is "we" in this statement? Is this saying pausedFollowers can include replicas that are not in the range descriptor, so we'll return such non-existent replicas from notActivelyReplicatingTo and call f.innerHandle.DisconnectStream(ctx, stream) on it?
A longer comment would be helpful.

irfansharif

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @pavelkalinnikov, @smg260, @sumeerbhola, and @tbg)

pkg/kv/kvserver/flow_control_replica_integration.go line 88 at r42 (raw file):