server: evaluate decommission pre-checks by AlexTalks · Pull Request #93758 · cockroachdb/cockroach

AlexTalks · 2022-12-15T22:49:46Z

This adds support for the evaluation of the decommission readiness of a
node (or set of nodes), by simulating their liveness to have the
DECOMMISSIONING status and utilizing the allocator to ensure that we are
able to perform any actions needed to repair the range. This supports a
"strict" mode, in which case we expect all ranges to only need
replacement or removal due to the decommissioning status, or a more
permissive "non-strict" mode, which allows for other actions needed, as
long as they do not encounter errors in finding a suitable allocation
target. The non-strict mode allows us to permit situations where a range
may have more than one action needed to repair it, such as a range that
needs to reach its replication factor before the decommissioning replica
can be replaced, or a range that needs to finalize an atomic replication
change.

Depends on #94024.

Part of #91568

cockroach-teamcity · 2022-12-15T22:49:57Z

This change is

This change implements the `DecommissionPreCheck` RPC on the `Admin` service, using the support for evaluating node decommission readiness by checking each range introduced in cockroachdb#93758. In checking node decommission readiness, only nodes that have a valid, non-`DECOMMISSIONED` liveness status are checked, and ranges with replicas on the checked nodes that encounter errors in attempting to allocate replacement replicas are reported in the response. Ranges that have replicas on multiple checked nodes have their errors reported for each nodeID in the request list. Depends on cockroachdb#93758, cockroachdb#90222. Epic: CRDB-20924 Release note: None

While we are correctly using the overrides set in the `OverrideStorePool` for the purposes of the node liveness function, the node count function did not properly incorporate the overrides previously. This change rectifies that, using the preset overrides specified at creation of the override store pool to calculate the number of non-decommissioning, non-decommissioned nodes (alive or dead), as viewed by the override store pool. This allows for correct calculation of the number of needed voters, allowing us to correctly determine which allocation action is needed for a range. Depends on cockroachdb#93758. Epic: CRDB-20924 Release note: None

kvoli

Reviewed 2 of 3 files at r2, 8 of 8 files at r3, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @AlexTalks)

pkg/server/decommission.go line 51 at r3 (raw file):

type decommissionRangeCheckResult struct {
	desc         *roachpb.RangeDescriptor
	action       allocatorimpl.AllocatorAction

It would be nice if the allocator action didn't leak into the server pkg here. We discussed offline that this will only RPC the string.

pkg/server/decommission.go line 168 at r3 (raw file):

	nodeIDs []roachpb.NodeID,
	strictReadiness bool,
	collectTraces bool,

See below comment on trace overhead at high decommission replica count when this is enabled

pkg/server/decommission.go line 252 at r3 (raw file):

			}

			action, _, recording, rErr := evalStore.AllocatorCheckRange(ctx, &desc, overrideStorePool)

What is the expected mem overhead of collecting a trace recording on every range containing a replica on a decommissioning node?

I see that the check can pass in a list of Nodes that are to be decommissioned so it's foreseeable that you could be collecting a few hundred k traces (especially with multi-store).

pkg/server/decommission_test.go line 170 at r3 (raw file):

	hasAll := true
	for _, idx := range serverIdxs {
		hasAll = hasAll && desc.Replicas().HasReplicaOnNode(tc.Server(idx).NodeID())

nit: Why not just return early here?

for (...) {
    if !desc.Replicas().HasReplicaOnNode(tc.Server(idx).NodeID()) {
        return false
    }
}
return true

AlexTalks

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @kvoli)

pkg/server/decommission.go line 51 at r3 (raw file):

Previously, kvoli (Austen) wrote…

It would be nice if the allocator action didn't leak into the server pkg here. We discussed offline that this will only RPC the string.

From the RPC, yes, but the action is used within the evaluation to determine a few things (see evaluateRangeCheckResult). In this return struct I can use a string of course, though it means the AllocatorAction will still be used in the server pkg.

pkg/server/decommission.go line 252 at r3 (raw file):

Previously, kvoli (Austen) wrote…

What is the expected mem overhead of collecting a trace recording on every range containing a replica on a decommissioning node?

I see that the check can pass in a list of Nodes that are to be decommissioned so it's foreseeable that you could be collecting a few hundred k traces (especially with multi-store).

This is a good point and was one of the things I most wanted input on in this review. Especially since right now, it's collecting traces always, and just returning them if collectTraces is on.

I think that we should collect traces only if requested, and we should also make sure we have a limit on the number of errors so that we can avoid huge memory overheads.

pkg/server/decommission_test.go line 170 at r3 (raw file):

Previously, kvoli (Austen) wrote…

nit: Why not just return early here?

for (...) {
    if !desc.Replicas().HasReplicaOnNode(tc.Server(idx).NodeID()) {
        return false
    }
}
return true

Can do that.

AlexTalks

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @kvoli)

pkg/server/decommission.go line 51 at r3 (raw file):

Previously, AlexTalks (Alex Sarkesian) wrote…

From the RPC, yes, but the action is used within the evaluation to determine a few things (see evaluateRangeCheckResult). In this return struct I can use a string of course, though it means the AllocatorAction will still be used in the server pkg.

Done.

pkg/server/decommission.go line 168 at r3 (raw file):

Previously, kvoli (Austen) wrote…

See below comment on trace overhead at high decommission replica count when this is enabled

Done.

pkg/server/decommission.go line 252 at r3 (raw file):

Previously, AlexTalks (Alex Sarkesian) wrote…

This is a good point and was one of the things I most wanted input on in this review. Especially since right now, it's collecting traces always, and just returning them if collectTraces is on.

I think that we should collect traces only if requested, and we should also make sure we have a limit on the number of errors so that we can avoid huge memory overheads.

Done.

This adds support for the evaluation of the decommission readiness of a node (or set of nodes), by simulating their liveness to have the DECOMMISSIONING status and utilizing the allocator to ensure that we are able to perform any actions needed to repair the range. This supports a "strict" mode, in which case we expect all ranges to only need replacement or removal due to the decommissioning status, or a more permissive "non-strict" mode, which allows for other actions needed, as long as they do not encounter errors in finding a suitable allocation target. The non-strict mode allows us to permit situations where a range may have more than one action needed to repair it, such as a range that needs to reach its replication factor before the decommissioning replica can be replaced, or a range that needs to finalize an atomic replication change. Depends on cockroachdb#94024. Part of cockroachdb#91568 Release note: None

This change implements the `DecommissionPreCheck` RPC on the `Admin` service, using the support for evaluating node decommission readiness by checking each range introduced in cockroachdb#93758. In checking node decommission readiness, only nodes that have a valid, non-`DECOMMISSIONED` liveness status are checked, and ranges with replicas on the checked nodes that encounter errors in attempting to allocate replacement replicas are reported in the response. Ranges that have replicas on multiple checked nodes have their errors reported for each nodeID in the request list. Depends on cockroachdb#93758, cockroachdb#90222. Epic: CRDB-20924 Release note: None

While we are correctly using the overrides set in the `OverrideStorePool` for the purposes of the node liveness function, the node count function did not properly incorporate the overrides previously. This change rectifies that, using the preset overrides specified at creation of the override store pool to calculate the number of non-decommissioning, non-decommissioned nodes (alive or dead), as viewed by the override store pool. This allows for correct calculation of the number of needed voters, allowing us to correctly determine which allocation action is needed for a range. Depends on cockroachdb#93758. Epic: CRDB-20924 Release note: None

kvoli

Reviewed 6 of 6 files at r4, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @AlexTalks)

AlexTalks · 2023-01-20T19:10:12Z

bors r=kvoli

craig · 2023-01-20T20:25:50Z

Build failed (retrying...):

Bazel Essential CI (Cockroach)

AlexTalks · 2023-01-20T22:08:51Z

bors r=kvoli

craig · 2023-01-20T22:08:53Z

Already running a review

craig · 2023-01-20T23:53:24Z

Build succeeded:

Bazel Essential CI (Cockroach)

While we are correctly using the overrides set in the `OverrideStorePool` for the purposes of the node liveness function, the node count function did not properly incorporate the overrides previously. This change rectifies that, using the preset overrides specified at creation of the override store pool to calculate the number of non-decommissioning, non-decommissioned nodes (alive or dead), as viewed by the override store pool. This allows for correct calculation of the number of needed voters, allowing us to correctly determine which allocation action is needed for a range. Depends on cockroachdb#93758. Epic: CRDB-20924 Release note: None

94983: allocator: correct node count calculations with overrides r=AlexTalks a=AlexTalks While we are correctly using the overrides set in the `OverrideStorePool` for the purposes of the node liveness function, the node count function did not properly incorporate the overrides previously. This change rectifies that, using the preset overrides specified at creation of the override store pool to calculate the number of non-decommissioning, non-decommissioned nodes (alive or dead), as viewed by the override store pool. This allows for correct calculation of the number of needed voters, allowing us to correctly determine which allocation action is needed for a range. Depends on #93758. Epic: [CRDB-20924](https://cockroachlabs.atlassian.net/browse/CRDB-20924) Release note: None Co-authored-by: Alex Sarkesian <sarkesian@cockroachlabs.com>

This change implements the `DecommissionPreCheck` RPC on the `Admin` service, using the support for evaluating node decommission readiness by checking each range introduced in cockroachdb#93758. In checking node decommission readiness, only nodes that have a valid, non-`DECOMMISSIONED` liveness status are checked, and ranges with replicas on the checked nodes that encounter errors in attempting to allocate replacement replicas are reported in the response. Ranges that have replicas on multiple checked nodes have their errors reported for each nodeID in the request list. Depends on cockroachdb#93758, cockroachdb#90222. Epic: CRDB-20924 Release note: None

93950: server: implement decommission pre-check api r=AlexTalks a=AlexTalks This change implements the `DecommissionPreCheck` RPC on the `Admin` service, using the support for evaluating node decommission readiness by checking each range introduced in #93758. In checking node decommission readiness, only nodes that have a valid, non-`DECOMMISSIONED` liveness status are checked, and ranges with replicas on the checked nodes that encounter errors in attempting to allocate replacement replicas are reported in the response. Ranges that have replicas on multiple checked nodes have their errors reported for each nodeID in the request list. Depends on #93758, #90222. Epic: [CRDB-20924](https://cockroachlabs.atlassian.net/browse/CRDB-20924) Release note: None Co-authored-by: Alex Sarkesian <sarkesian@cockroachlabs.com>

AlexTalks requested review from a team as code owners December 15, 2022 22:49

AlexTalks requested a review from kvoli December 15, 2022 22:54

AlexTalks force-pushed the dpf_evaluate branch 2 times, most recently from ff2f482 to c9f5e35 Compare December 17, 2022 00:58

AlexTalks requested a review from a team as a code owner December 17, 2022 00:58

AlexTalks force-pushed the dpf_evaluate branch from c9f5e35 to 6ddfcb7 Compare December 19, 2022 21:23

AlexTalks mentioned this pull request Dec 19, 2022

server: implement decommission pre-check api #93950

Merged

AlexTalks force-pushed the dpf_evaluate branch from 6ddfcb7 to 8c4d3bb Compare December 19, 2022 23:43

AlexTalks force-pushed the dpf_evaluate branch from 8c4d3bb to 3dfe4dd Compare January 7, 2023 09:27

AlexTalks requested a review from a team January 7, 2023 09:27

AlexTalks force-pushed the dpf_evaluate branch from 3dfe4dd to e98f254 Compare January 7, 2023 09:29

AlexTalks force-pushed the dpf_evaluate branch from e98f254 to 74b47e7 Compare January 7, 2023 10:29

AlexTalks changed the title ~~server: implement decommission pre-checks~~ server: evaluate decommission pre-checks Jan 7, 2023

AlexTalks force-pushed the dpf_evaluate branch from 74b47e7 to 9c05382 Compare January 10, 2023 05:17

AlexTalks mentioned this pull request Jan 10, 2023

allocator: correct node count calculations with overrides #94983

Merged

AlexTalks force-pushed the dpf_evaluate branch 2 times, most recently from 65641db to 1e57036 Compare January 10, 2023 08:49

kvoli approved these changes Jan 11, 2023

View reviewed changes

AlexTalks commented Jan 18, 2023

View reviewed changes

AlexTalks force-pushed the dpf_evaluate branch from 1e57036 to 8debc93 Compare January 19, 2023 00:36

AlexTalks commented Jan 19, 2023

View reviewed changes

AlexTalks force-pushed the dpf_evaluate branch from 8debc93 to bfa646d Compare January 19, 2023 20:37

AlexTalks force-pushed the dpf_evaluate branch from bfa646d to 09181f8 Compare January 19, 2023 21:36

kvoli approved these changes Jan 20, 2023

View reviewed changes

craig bot merged commit 1b79102 into cockroachdb:master Jan 20, 2023

AlexTalks deleted the dpf_evaluate branch January 20, 2023 23:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: evaluate decommission pre-checks#93758

server: evaluate decommission pre-checks#93758
craig[bot] merged 1 commit intocockroachdb:masterfrom
AlexTalks:dpf_evaluate

AlexTalks commented Dec 15, 2022 •

edited

Loading

Uh oh!

cockroach-teamcity commented Dec 15, 2022

Uh oh!

kvoli left a comment

Uh oh!

AlexTalks left a comment

Uh oh!

AlexTalks left a comment

Uh oh!

kvoli left a comment

Uh oh!

AlexTalks commented Jan 20, 2023

Uh oh!

craig bot commented Jan 20, 2023

Uh oh!

AlexTalks commented Jan 20, 2023

Uh oh!

craig bot commented Jan 20, 2023

Uh oh!

craig bot commented Jan 20, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

AlexTalks commented Dec 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented Dec 15, 2022

Uh oh!

kvoli left a comment

Choose a reason for hiding this comment

Uh oh!

AlexTalks left a comment

Choose a reason for hiding this comment

Uh oh!

AlexTalks left a comment

Choose a reason for hiding this comment

Uh oh!

kvoli left a comment

Choose a reason for hiding this comment

Uh oh!

AlexTalks commented Jan 20, 2023

Uh oh!

craig bot commented Jan 20, 2023

Uh oh!

AlexTalks commented Jan 20, 2023

Uh oh!

craig bot commented Jan 20, 2023

Uh oh!

craig bot commented Jan 20, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

AlexTalks commented Dec 15, 2022 •

edited

Loading