storage: issue swaps on AllocatorConsiderRebalance#40284
storage: issue swaps on AllocatorConsiderRebalance#40284craig[bot] merged 7 commits intocockroachdb:masterfrom
Conversation
nvb
left a comment
There was a problem hiding this comment.
Reviewed 1 of 1 files at r1, 3 of 3 files at r2, 1 of 1 files at r3, 1 of 1 files at r4, 1 of 1 files at r5, 4 of 4 files at r6.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @tbg)
pkg/storage/allocator.go, line 402 at r2 (raw file):
// The dead replica(s) will be down-replicated later. priority := addDeadReplacementPriority log.VEventf(ctx, 3, "AllocatorAdd - replacement for %d dead replicas priority=%.2f",
AllocatorReplaceDead
pkg/storage/allocator.go, line 410 at r2 (raw file):
// Range has decommissioning replica(s), which should be replaced. priority := addDecommissioningReplacementPriority log.VEventf(ctx, 3, "AllocatorAdd - replacement for %d decommissioning replicas priority=%.2f",
AllocatorReplaceDecommissioning
pkg/storage/allocator_test.go, line 5023 at r2 (raw file):
action, _ := a.ComputeAction(ctx, zone, &desc) require.Equal(t, c.expectedAction.String(), action.String())
Why do we need the .String()?
pkg/storage/replicate_queue.go, line 344 at r2 (raw file):
existingReplicas := liveVoterReplicas // WIP(tbg): pass a slice of replicas that can be replaced in. // In ReplaceDead, the it's the dead replicas, in ReplaceDecommissioning
s/the it's/it's/
pkg/storage/replicate_queue.go, line 423 at r5 (raw file):
ctx, repl, []roachpb.ReplicationChange{{Target: newReplica, ChangeType: roachpb.ADD_REPLICA}},
Can we not use roachpb.MakeReplicationChanges here and below?
pkg/storage/replicate_queue.go, line 717 at r6 (raw file):
not taking into account the replica we're going to remove
I must be missing how we're ensuring this. findRemoveTarget doesn't mutate existingReplicas, does it?
pkg/storage/replicate_queue_test.go, line 140 at r6 (raw file):
// Query the range log to see if anything unexpected happened. Concretely, // we'll make sure that our tracked ranges never had >=3 replicas.
> 3 replicas
tbg
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)
pkg/storage/allocator_test.go, line 5023 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Why do we need the
.String()?
It was printed as %d for some reason.
pkg/storage/replicate_queue.go, line 717 at r6 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
not taking into account the replica we're going to removeI must be missing how we're ensuring this.
findRemoveTargetdoesn't mutateexistingReplicas, does it?
We're removing one from existingReplicas and are adding one that is not in existingReplicas, hence they're not the same, so I think the comment is true-- but looking into this I saw that I'm reinventing the wheel:
cockroach/pkg/storage/allocator.go
Lines 708 to 715 in 8d8e8ff
I should return both the removal and addition target from `RebalanceTarget.
b6d5c9c to
6315bd3
Compare
nvb
left a comment
There was a problem hiding this comment.
Reviewed 5 of 5 files at r7, 1 of 1 files at r8, 1 of 1 files at r9, 1 of 1 files at r10, 4 of 4 files at r11.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @tbg)
pkg/storage/replicate_queue.go, line 717 at r6 (raw file):
We're removing one from existingReplicas and are adding one that is not in existingReplicas
Yeah, but where are we doing that?
Oh, unless the point is that existingReplicas still contains removeReplica when passed to RebalanceTarget, in which case the "not taking into account the replica we're going to remove" is easy to misinterpret as the opposite of what it's intending to say.
I should return both the removal and addition target from `RebalanceTarget.
You're doing that now?
tbg
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)
pkg/storage/replicate_queue.go, line 717 at r6 (raw file):
You're doing that now?
Yep, done
nvb
left a comment
There was a problem hiding this comment.
Reviewed 1 of 3 files at r12.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)
pkg/storage/allocator_test.go, line 609 at r12 (raw file):
{ var rangeUsageInfo RangeUsageInfo target, details, _, ok := a.RebalanceTarget(
Is details in the wrong position?
pkg/storage/allocator_test.go, line 855 at r12 (raw file):
sg.GossipStores(stores, t) for i := 0; i < 10; i++ { target, details, _, ok := a.RebalanceTarget(
Same question.
pkg/storage/allocator_test.go, line 873 at r12 (raw file):
sg.GossipStores(stores, t) for i := 0; i < 10; i++ { target, details, _, ok := a.RebalanceTarget(
Mind checking all of these?
pkg/storage/replicate_queue.go, line 717 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
You're doing that now?
Yep, done
We don't need to transfer the lease away from removeTarget?
pkg/storage/replicate_queue.go, line 697 at r12 (raw file):
// rebalance. Attempt to find a rebalancing target. if !rq.store.TestingKnobs().DisableReplicaRebalancing { // Then, not taking into account the replica we're going to remove, see
This comment is strange now, since this is the only step.
36bbb63 to
8c66ec7
Compare
tbg
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)
pkg/storage/replicate_queue.go, line 717 at r6 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
We don't need to transfer the lease away from
removeTarget?
Good catch, we do. Added.
nvb
left a comment
There was a problem hiding this comment.
Reviewed 2 of 2 files at r13.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @tbg)
pkg/storage/replicate_queue.go, line 705 at r13 (raw file):
} else if done, err := rq.maybeTransferLeaseAway(ctx, repl, removeTarget.StoreID, dryRun); err != nil { log.VEventf(ctx, 1, "want to remove self, but failed to transfer lease away: %s", err) } else if !done {
nit: I'd do this like:
if !ok {
...
} else if done, err := rq.maybeTransferLeaseAway(ctx, repl, removeTarget.StoreID, dryRun); err != nil {
...
} else if done {
// Lease is now elsewhere, so we're not in charge any more.
return false, nil
} else {
...
}
This gives you a place to add a comment.
8c66ec7 to
1121c4c
Compare
Down from 10s to the usual 3-4s for this type of test. Release note: None
Teach the allocator that there is a concept of "replacing" dead and decommissioning replicas. Add a stub handler for them to the replicate queue that does exactly what AllocatorAdd does, i.e., maintain the previous behavior. A follow-up commit will then teach the replicate queue to actually issue replica replacement operations both for these two new options as well as in AllocatorConsiderRebalance. When that is established, we should be in a place where the replicate queue never puts ranges into fragile intermediate configurations except that it will still transition through even configurations when upreplicating (which can be tackled separately). Release note: None
We'll need this to find replacements when rebalancing. Release note: None
This code repeats too much, and we'll need it in more places soon. Release note: None
Release note: None
Change the rebalancing code so that it not only looks up a new replica to add, but also picks one to remove. Both actions are then given to a ChangeReplicas invocation which will carry it out atomically as long as that feature is enabled. Release note (bug fix): Replicas can now be moved between stores without entering an intermediate configuration that violates the zone constraints. Violations may still occur during zone config changes, decommissioning, and in the presence of dead nodes (NB: the remainder be addressed in a future change, so merge the corresponding release note)
RebalanceTarget was internally already picking out a replica that could be removed. This is now returned to the caller and actually removed during rebalancing. Release note: None
1121c4c to
f94a83e
Compare
tbg
left a comment
There was a problem hiding this comment.
TFTR!
bors r=nvanbenschoten
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten)
40208: distsql: add disk spilling to lookup joiner r=solongordon a=solongordon In lookup joins on partial index keys, there is no limit on how many rows might be returned by any particular lookup, so the joinreader may be buffering an unbounded number of rows into memory. I changed joinreader to use a disk-backed row container rather than just storing the rows in memory with no accounting. Fixes #39044 Release note (bug fix): Lookup joins now spill to disk if the index lookups return more rows than can be stored in memory. 40284: storage: issue swaps on AllocatorConsiderRebalance r=nvanbenschoten a=tbg Change the rebalancing code so that it not only looks up a new replica to add, but also picks one to remove. Both actions are then given to a ChangeReplicas invocation which will carry it out atomically as long as that feature is enabled. Release note (bug fix): Replicas can now be moved between stores without entering an intermediate configuration that violates the zone constraints. Violations may still occur during zone config changes, decommissioning, and in the presence of dead nodes (NB: the remainder be addressed in a future change, so merge the corresponding release note) 40300: store: pull updateMVCCGauges out of StoreMetrics lock, use atomics r=nvanbenschoten a=nvanbenschoten The operations it performs are already atomic, so we can use atomic add instructions to avoid any critical section. This was responsible for 8.15% of mutex contention on a YCSB run. The change also removes MVCCStats from the `storeMetrics` interface, which addresses a long-standing TODO. 40301: roachtest: Deflake clock jump test r=tbg a=bdarnell These tests perform various clock jumps, then reverse them. The reverse can cause a crash even if the original jump did not. Add some sleeps to make things more deterministic and improve the recovery process at the end of the test. Fixes #38723 Release note: None 40305: exec: modify tests to catch bad selection vector access r=rafiss a=rafiss The runTests helper will now cause a panic if a vectorized operator tries to access a part of the selection vector that is out of bounds. This identified bugs in the projection operator. Release note: None Co-authored-by: Solon Gordon <solon@cockroachlabs.com> Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com> Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com> Co-authored-by: Ben Darnell <ben@cockroachlabs.com> Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>
Build succeeded |
As of cockroachdb#40284, the replicate queue was issuing swaps (atomic add+remove) during rebalancing. TestInitialPartitioning helpfully points out (once you flip atomic rebalancing on) that when the replication factor is one, there is no way to perform such an atomic swap because it will necessarily have to remove the leaseholder. To work around this restriction (which, by the way, we dislike - see \cockroachdb#40333), fall back to just adding a replica in this case without also removing one. In the next scanner cycle (which should happen immediately since we requeue the range) the range will be over-replicated and hopefully the lease will be transferred over and then the original leaseholder removed. I would be very doubtful that this all works, but it is how things worked until cockroachdb#40284, so this PR really just falls back to the previous behavior in cases where we can't do better. Release note: None
As of cockroachdb#40284, the replicate queue was issuing swaps (atomic add+remove) during rebalancing. TestInitialPartitioning helpfully points out (once you flip atomic rebalancing on) that when the replication factor is one, there is no way to perform such an atomic swap because it will necessarily have to remove the leaseholder. To work around this restriction (which, by the way, we dislike - see \cockroachdb#40333), fall back to just adding a replica in this case without also removing one. In the next scanner cycle (which should happen immediately since we requeue the range) the range will be over-replicated and hopefully the lease will be transferred over and then the original leaseholder removed. I would be very doubtful that this all works, but it is how things worked until cockroachdb#40284, so this PR really just falls back to the previous behavior in cases where we can't do better. Release note: None
As of cockroachdb#40284, the replicate queue was issuing swaps (atomic add+remove) during rebalancing. TestInitialPartitioning helpfully points out (once you flip atomic rebalancing on) that when the replication factor is one, there is no way to perform such an atomic swap because it will necessarily have to remove the leaseholder. To work around this restriction (which, by the way, we dislike - see \cockroachdb#40333), fall back to just adding a replica in this case without also removing one. In the next scanner cycle (which should happen immediately since we requeue the range) the range will be over-replicated and hopefully the lease will be transferred over and then the original leaseholder removed. I would be very doubtful that this all works, but it is how things worked until cockroachdb#40284, so this PR really just falls back to the previous behavior in cases where we can't do better. Release note: None
40363: storage: work around can't-swap-leaseholder r=nvanbenschoten a=tbg As of #40284, the replicate queue was issuing swaps (atomic add+remove) during rebalancing. TestInitialPartitioning helpfully points out (once you flip atomic rebalancing on) that when the replication factor is one, there is no way to perform such an atomic swap because it will necessarily have to remove the leaseholder. To work around this restriction (which, by the way, we dislike - see \#40333), fall back to just adding a replica in this case without also removing one. In the next scanner cycle (which should happen immediately since we requeue the range) the range will be over-replicated and hopefully the lease will be transferred over and then the original leaseholder removed. I would be very doubtful that this all works, but it is how things worked until #40284, so this PR really just falls back to the previous behavior in cases where we can't do better. Release note: None Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
Change the rebalancing code so that it not only looks up a new replica to
add, but also picks one to remove. Both actions are then given to a
ChangeReplicas invocation which will carry it out atomically as long as
that feature is enabled.
Release note (bug fix): Replicas can now be moved between stores without
entering an intermediate configuration that violates the zone constraints.
Violations may still occur during zone config changes, decommissioning, and
in the presence of dead nodes (NB: the remainder be addressed in a future
change, so merge the corresponding release note)