server: evaluate decommission pre-checks#93758
Conversation
ff2f482 to
c9f5e35
Compare
c9f5e35 to
6ddfcb7
Compare
6ddfcb7 to
8c4d3bb
Compare
This change implements the `DecommissionPreCheck` RPC on the `Admin` service, using the support for evaluating node decommission readiness by checking each range introduced in cockroachdb#93758. In checking node decommission readiness, only nodes that have a valid, non-`DECOMMISSIONED` liveness status are checked, and ranges with replicas on the checked nodes that encounter errors in attempting to allocate replacement replicas are reported in the response. Ranges that have replicas on multiple checked nodes have their errors reported for each nodeID in the request list. Depends on cockroachdb#93758, cockroachdb#90222. Epic: CRDB-20924 Release note: None
8c4d3bb to
3dfe4dd
Compare
3dfe4dd to
e98f254
Compare
e98f254 to
74b47e7
Compare
This change implements the `DecommissionPreCheck` RPC on the `Admin` service, using the support for evaluating node decommission readiness by checking each range introduced in cockroachdb#93758. In checking node decommission readiness, only nodes that have a valid, non-`DECOMMISSIONED` liveness status are checked, and ranges with replicas on the checked nodes that encounter errors in attempting to allocate replacement replicas are reported in the response. Ranges that have replicas on multiple checked nodes have their errors reported for each nodeID in the request list. Depends on cockroachdb#93758, cockroachdb#90222. Epic: CRDB-20924 Release note: None
74b47e7 to
9c05382
Compare
65641db to
1e57036
Compare
While we are correctly using the overrides set in the `OverrideStorePool` for the purposes of the node liveness function, the node count function did not properly incorporate the overrides previously. This change rectifies that, using the preset overrides specified at creation of the override store pool to calculate the number of non-decommissioning, non-decommissioned nodes (alive or dead), as viewed by the override store pool. This allows for correct calculation of the number of needed voters, allowing us to correctly determine which allocation action is needed for a range. Depends on cockroachdb#93758. Epic: CRDB-20924 Release note: None
kvoli
left a comment
There was a problem hiding this comment.
Reviewed 2 of 3 files at r2, 8 of 8 files at r3, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @AlexTalks)
pkg/server/decommission.go line 51 at r3 (raw file):
type decommissionRangeCheckResult struct { desc *roachpb.RangeDescriptor action allocatorimpl.AllocatorAction
It would be nice if the allocator action didn't leak into the server pkg here. We discussed offline that this will only RPC the string.
pkg/server/decommission.go line 168 at r3 (raw file):
nodeIDs []roachpb.NodeID, strictReadiness bool, collectTraces bool,
See below comment on trace overhead at high decommission replica count when this is enabled
pkg/server/decommission.go line 252 at r3 (raw file):
} action, _, recording, rErr := evalStore.AllocatorCheckRange(ctx, &desc, overrideStorePool)
What is the expected mem overhead of collecting a trace recording on every range containing a replica on a decommissioning node?
I see that the check can pass in a list of Nodes that are to be decommissioned so it's foreseeable that you could be collecting a few hundred k traces (especially with multi-store).
pkg/server/decommission_test.go line 170 at r3 (raw file):
hasAll := true for _, idx := range serverIdxs { hasAll = hasAll && desc.Replicas().HasReplicaOnNode(tc.Server(idx).NodeID())
nit: Why not just return early here?
for (...) {
if !desc.Replicas().HasReplicaOnNode(tc.Server(idx).NodeID()) {
return false
}
}
return true
AlexTalks
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @kvoli)
pkg/server/decommission.go line 51 at r3 (raw file):
Previously, kvoli (Austen) wrote…
It would be nice if the allocator action didn't leak into the server pkg here. We discussed offline that this will only RPC the string.
From the RPC, yes, but the action is used within the evaluation to determine a few things (see evaluateRangeCheckResult). In this return struct I can use a string of course, though it means the AllocatorAction will still be used in the server pkg.
pkg/server/decommission.go line 252 at r3 (raw file):
Previously, kvoli (Austen) wrote…
What is the expected mem overhead of collecting a trace recording on every range containing a replica on a decommissioning node?
I see that the check can pass in a list of Nodes that are to be decommissioned so it's foreseeable that you could be collecting a few hundred k traces (especially with multi-store).
This is a good point and was one of the things I most wanted input on in this review. Especially since right now, it's collecting traces always, and just returning them if collectTraces is on.
I think that we should collect traces only if requested, and we should also make sure we have a limit on the number of errors so that we can avoid huge memory overheads.
pkg/server/decommission_test.go line 170 at r3 (raw file):
Previously, kvoli (Austen) wrote…
nit: Why not just return early here?
for (...) { if !desc.Replicas().HasReplicaOnNode(tc.Server(idx).NodeID()) { return false } } return true
Can do that.
1e57036 to
8debc93
Compare
AlexTalks
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @kvoli)
pkg/server/decommission.go line 51 at r3 (raw file):
Previously, AlexTalks (Alex Sarkesian) wrote…
From the RPC, yes, but the action is used within the evaluation to determine a few things (see
evaluateRangeCheckResult). In this return struct I can use a string of course, though it means theAllocatorActionwill still be used in the server pkg.
Done.
pkg/server/decommission.go line 168 at r3 (raw file):
Previously, kvoli (Austen) wrote…
See below comment on trace overhead at high decommission replica count when this is enabled
Done.
pkg/server/decommission.go line 252 at r3 (raw file):
Previously, AlexTalks (Alex Sarkesian) wrote…
This is a good point and was one of the things I most wanted input on in this review. Especially since right now, it's collecting traces always, and just returning them if
collectTracesis on.I think that we should collect traces only if requested, and we should also make sure we have a limit on the number of errors so that we can avoid huge memory overheads.
Done.
8debc93 to
bfa646d
Compare
This adds support for the evaluation of the decommission readiness of a node (or set of nodes), by simulating their liveness to have the DECOMMISSIONING status and utilizing the allocator to ensure that we are able to perform any actions needed to repair the range. This supports a "strict" mode, in which case we expect all ranges to only need replacement or removal due to the decommissioning status, or a more permissive "non-strict" mode, which allows for other actions needed, as long as they do not encounter errors in finding a suitable allocation target. The non-strict mode allows us to permit situations where a range may have more than one action needed to repair it, such as a range that needs to reach its replication factor before the decommissioning replica can be replaced, or a range that needs to finalize an atomic replication change. Depends on cockroachdb#94024. Part of cockroachdb#91568 Release note: None
bfa646d to
09181f8
Compare
This change implements the `DecommissionPreCheck` RPC on the `Admin` service, using the support for evaluating node decommission readiness by checking each range introduced in cockroachdb#93758. In checking node decommission readiness, only nodes that have a valid, non-`DECOMMISSIONED` liveness status are checked, and ranges with replicas on the checked nodes that encounter errors in attempting to allocate replacement replicas are reported in the response. Ranges that have replicas on multiple checked nodes have their errors reported for each nodeID in the request list. Depends on cockroachdb#93758, cockroachdb#90222. Epic: CRDB-20924 Release note: None
While we are correctly using the overrides set in the `OverrideStorePool` for the purposes of the node liveness function, the node count function did not properly incorporate the overrides previously. This change rectifies that, using the preset overrides specified at creation of the override store pool to calculate the number of non-decommissioning, non-decommissioned nodes (alive or dead), as viewed by the override store pool. This allows for correct calculation of the number of needed voters, allowing us to correctly determine which allocation action is needed for a range. Depends on cockroachdb#93758. Epic: CRDB-20924 Release note: None
kvoli
left a comment
There was a problem hiding this comment.
Reviewed 6 of 6 files at r4, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @AlexTalks)
|
bors r=kvoli |
|
Build failed (retrying...): |
|
bors r=kvoli |
|
Already running a review |
|
Build succeeded: |
While we are correctly using the overrides set in the `OverrideStorePool` for the purposes of the node liveness function, the node count function did not properly incorporate the overrides previously. This change rectifies that, using the preset overrides specified at creation of the override store pool to calculate the number of non-decommissioning, non-decommissioned nodes (alive or dead), as viewed by the override store pool. This allows for correct calculation of the number of needed voters, allowing us to correctly determine which allocation action is needed for a range. Depends on cockroachdb#93758. Epic: CRDB-20924 Release note: None
94983: allocator: correct node count calculations with overrides r=AlexTalks a=AlexTalks While we are correctly using the overrides set in the `OverrideStorePool` for the purposes of the node liveness function, the node count function did not properly incorporate the overrides previously. This change rectifies that, using the preset overrides specified at creation of the override store pool to calculate the number of non-decommissioning, non-decommissioned nodes (alive or dead), as viewed by the override store pool. This allows for correct calculation of the number of needed voters, allowing us to correctly determine which allocation action is needed for a range. Depends on #93758. Epic: [CRDB-20924](https://cockroachlabs.atlassian.net/browse/CRDB-20924) Release note: None Co-authored-by: Alex Sarkesian <sarkesian@cockroachlabs.com>
This change implements the `DecommissionPreCheck` RPC on the `Admin` service, using the support for evaluating node decommission readiness by checking each range introduced in cockroachdb#93758. In checking node decommission readiness, only nodes that have a valid, non-`DECOMMISSIONED` liveness status are checked, and ranges with replicas on the checked nodes that encounter errors in attempting to allocate replacement replicas are reported in the response. Ranges that have replicas on multiple checked nodes have their errors reported for each nodeID in the request list. Depends on cockroachdb#93758, cockroachdb#90222. Epic: CRDB-20924 Release note: None
This change implements the `DecommissionPreCheck` RPC on the `Admin` service, using the support for evaluating node decommission readiness by checking each range introduced in cockroachdb#93758. In checking node decommission readiness, only nodes that have a valid, non-`DECOMMISSIONED` liveness status are checked, and ranges with replicas on the checked nodes that encounter errors in attempting to allocate replacement replicas are reported in the response. Ranges that have replicas on multiple checked nodes have their errors reported for each nodeID in the request list. Depends on cockroachdb#93758, cockroachdb#90222. Epic: CRDB-20924 Release note: None
93950: server: implement decommission pre-check api r=AlexTalks a=AlexTalks This change implements the `DecommissionPreCheck` RPC on the `Admin` service, using the support for evaluating node decommission readiness by checking each range introduced in #93758. In checking node decommission readiness, only nodes that have a valid, non-`DECOMMISSIONED` liveness status are checked, and ranges with replicas on the checked nodes that encounter errors in attempting to allocate replacement replicas are reported in the response. Ranges that have replicas on multiple checked nodes have their errors reported for each nodeID in the request list. Depends on #93758, #90222. Epic: [CRDB-20924](https://cockroachlabs.atlassian.net/browse/CRDB-20924) Release note: None Co-authored-by: Alex Sarkesian <sarkesian@cockroachlabs.com>
This adds support for the evaluation of the decommission readiness of a
node (or set of nodes), by simulating their liveness to have the
DECOMMISSIONING status and utilizing the allocator to ensure that we are
able to perform any actions needed to repair the range. This supports a
"strict" mode, in which case we expect all ranges to only need
replacement or removal due to the decommissioning status, or a more
permissive "non-strict" mode, which allows for other actions needed, as
long as they do not encounter errors in finding a suitable allocation
target. The non-strict mode allows us to permit situations where a range
may have more than one action needed to repair it, such as a range that
needs to reach its replication factor before the decommissioning replica
can be replaced, or a range that needs to finalize an atomic replication
change.
Depends on #94024.
Part of #91568