kvserver: fix constraint analysis on replica replacement#94810
kvserver: fix constraint analysis on replica replacement#94810craig[bot] merged 2 commits intocockroachdb:masterfrom
Conversation
2dff994 to
d24c606
Compare
4f8e335 to
80d6a10
Compare
kvoli
left a comment
There was a problem hiding this comment.
Few comments/questions. The linter is making noise in a few areas.
Reviewed 4 of 4 files at r1, 8 of 8 files at r2, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @AlexTalks)
-- commits line 22 at r2:
nit: beingn
pkg/kv/kvserver/allocator/allocatorimpl/allocator.go line 1326 at r2 (raw file):
conf roachpb.SpanConfig, existingVoters, existingNonVoters []roachpb.ReplicaDescriptor, replacing *roachpb.ReplicaDescriptor,
I'm not a huge fan of passing in a reference that could be mutated. I understand we discussed before that a list wasn't great either since it implied you could be replacing more than 1 replica at a time. Is there an alternative?
pkg/kv/kvserver/allocator/allocatorimpl/allocator_scorer.go line 1907 at r2 (raw file):
// If so, then only replacing it with a satisfying store is valid to ensure // that the constraint stays fully satisfied. if necessary && satisfiedByExistingStore && !satisfiedByCandidateStore {
Nice stuff here - this is clear to understand.
pkg/kv/kvserver/allocator_impl_test.go line 112 at r1 (raw file):
Locality: roachpb.Locality{ Tiers: []roachpb.Tier{ {
It looks like these are added in the 1st commit then edited in the 2nd commit to remove some flags. Is this necessary?
80d6a10 to
b856e5d
Compare
AlexTalks
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @kvoli)
pkg/kv/kvserver/allocator/allocatorimpl/allocator.go line 1326 at r2 (raw file):
Previously, kvoli (Austen) wrote…
I'm not a huge fan of passing in a reference that could be mutated. I understand we discussed before that a list wasn't great either since it implied you could be replacing more than 1 replica at a time. Is there an alternative?
I can pass in a concrete roachpb.ReplicaDescriptor and use Validate (or some new IsValid function) to check if it has a value of course. I prefer using nil to indicate that we don't have a replica being replaced, but I understand the concern here. Let me know if you definitely think this should be changed.
pkg/kv/kvserver/allocator/allocatorimpl/allocator_scorer.go line 1907 at r2 (raw file):
Previously, kvoli (Austen) wrote…
Nice stuff here - this is clear to understand.
Thanks - I was worried we might need additional cases here but hopefully this makes sense. This is also the only case where we can return valid = false, necessary = true, but that is primarily done so we can log it in the ranking process.
pkg/kv/kvserver/allocator_impl_test.go line 112 at r1 (raw file):
Previously, kvoli (Austen) wrote…
It looks like these are added in the 1st commit then edited in the 2nd commit to remove some flags. Is this necessary?
Not necessary, should be fixed now.
kvoli
left a comment
There was a problem hiding this comment.
Reviewed 8 of 8 files at r3, 8 of 8 files at r4, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @AlexTalks)
pkg/kv/kvserver/allocator/allocatorimpl/allocator.go line 1326 at r2 (raw file):
Previously, AlexTalks (Alex Sarkesian) wrote…
I can pass in a concrete
roachpb.ReplicaDescriptorand useValidate(or some newIsValidfunction) to check if it has a value of course. I prefer usingnilto indicate that we don't have a replica being replaced, but I understand the concern here. Let me know if you definitely think this should be changed.
It doesn't seem worth changing over it.
pkg/kv/kvserver/allocator_impl_test.go line 112 at r1 (raw file):
Previously, AlexTalks (Alex Sarkesian) wrote…
Not necessary, should be fixed now.
Did this update push? I can still see them being removed in c2.
b856e5d#diff-a380886513365b4c5d0612c9bb1f2ee62a11571d189141fcb56d4c41182f4dbfL112
pkg/kv/kvserver/store_test.go line 3595 at r4 (raw file):
{NodeID: 5, StoreID: 5, ReplicaID: 5}, }, zoneConfig: zonepb.DefaultSystemZoneConfigRef(),
s/zoneConfig/spanConfig/ here and few other spots.
b856e5d to
32eec11
Compare
AlexTalks
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @kvoli)
pkg/kv/kvserver/allocator/allocatorimpl/allocator.go line 1326 at r2 (raw file):
Previously, kvoli (Austen) wrote…
It doesn't seem worth changing over it.
Done.
pkg/kv/kvserver/allocator_impl_test.go line 112 at r1 (raw file):
Previously, kvoli (Austen) wrote…
Did this update push? I can still see them being removed in c2.
b856e5d#diff-a380886513365b4c5d0612c9bb1f2ee62a11571d189141fcb56d4c41182f4dbfL112
Fixed!
pkg/kv/kvserver/store_test.go line 3595 at r4 (raw file):
Previously, kvoli (Austen) wrote…
s/zoneConfig/spanConfig/here and few other spots.
Bad rebase, sorry!
This change adds tests using `store.AllocatorCheckRange(..)` to
investigate the behavior or attempting to decommission a node when doing
so would cause constraints to be broken. This test is needed because it
was uncovered recently that a partially constrained range (i.e. a range
configured with `num_replicas = 3, constraint = '{<some constraint>:
1}'` may not have the constraint fully respected if the only node that
satisfies said constraint is being decommissioned.
Part of cockroachdb#94809
Depends on cockroachdb#94024.
Release note: None
This changes the allocator to be fully aware of the replica that is beingn replaced when allocating new target replicas due to a dead or decommissioning node. This ensures that if constraints were respected before the dead or decommissioning node left the cluster, they will still be respected afterwards, particularly in the case at issue, which is when partial constraints are set on a range (e.g. `num_replicas = 3, <some_constraint>: 1`). This is achieved by rejecting candidate stores to allocate the replacement on when they do not satisfy a constraint that was satisfied by the existing store. In doing so, this means that some node decommissions that would previously be able to execute will now not allow the user to decommission the node and violate the configured constraints. Fixes cockroachdb#94809. Release note (bug fix): Decommissions that would violate constraints set on a subset of replicas for a range (e.g. `num_replicas = 3, <constraint>: 1`) will no longer be able to execute, respecting constraints during and after the decommission.
32eec11 to
f7db7e3
Compare
AlexTalks
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @kvoli)
pkg/kv/kvserver/allocator/allocatorimpl/allocator.go line 1216 at r8 (raw file):
// should be considered of otherwise equal validity, with candidate ranking // chosing the best of the available options. var decommissioningReplica *roachpb.ReplicaDescriptor
FYI @kvoli I added this after some thought and reviewing a failing multiregion test. I think it makes sense, but if you don't like this logic being in the allocator (and would rather have it in the replicate queue), let me know. Given that I think our goal is to consolidate more logic into the allocator, it made sense to me to live here.
(The issue is, briefly, if we have a multi-region cluster configured to survive region failure, and all nodes in a region die, we want to ensure that replacement of their replicas proceeds despite their replacement effectively breaking constraints).
|
bors r+ |
|
Build succeeded: |
This changes the allocator to be fully aware of the replica that is
beingn replaced when allocating new target replicas due to a dead or
decommissioning node. This ensures that if constraints were respected
before the dead or decommissioning node left the cluster, they will
still be respected afterwards, particularly in the case at issue, which
is when partial constraints are set on a range (e.g.
num_replicas = 3, <some_constraint>: 1). This is achieved by rejecting candidatestores to allocate the replacement on when they do not satisfy a
constraint that was satisfied by the existing store. In doing so, this
means that some node decommissions that would previously be able to
execute will now not allow the user to decommission the node and violate
the configured constraints.
Fixes #94809.
Depends on #94024.
Release note (bug fix): Decommissions that would violate constraints set
on a subset of replicas for a range (e.g.
num_replicas = 3, <constraint>: 1) will no longer be able to execute, respectingconstraints during and after the decommission.