Skip to content

server: don't drain when decommissioning#28707

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
tbg:fix/decommission-quiesce
Aug 16, 2018
Merged

server: don't drain when decommissioning#28707
craig[bot] merged 1 commit intocockroachdb:masterfrom
tbg:fix/decommission-quiesce

Conversation

@tbg
Copy link
Copy Markdown
Member

@tbg tbg commented Aug 16, 2018

Prior to this commit, a server which enters decommissioning would
automatically drain. This wasn't a great idea since it meant that
pointing the decommissioning command at a set of nodes required
for the cluster to work would brick that cluster, and would be
difficult to recover from (since the draining state is picked
up by the nodes when they start).

Instead, decouple draining from decommissioning. Decommissioning simply
tells the allocator to move data off that node; when the node gets shut
down cleanly (the operator's responsiblity), it will drain.

The resulting behavior when trying to decommission a too-large set of
nodes is now that the decommission command will simply not finish.
For example, starting a three node cluster and decommissioning all
nodes will simply hang (though the nodes will be marked as
decommissioning). Add three more nodes to the cluster and replicas
will move over to the newly added nodes, and the decommissioning
command will finish.

As a bonus, recommissioning a node now doesn't require the target
node to restart.

As a second bonus, the decommissioning acceptance test now takes
around 60% of the previous time (~45s down from 70+).

One remaining caveat is that users may forget that they attempted
to decommission nodes. We need to check that we prominently alert
in the UI when nodes are decommissioning. This isn't a new problem.

Fixes #27444.
Fixes #27025.

Release note (bug fix): decommissioning multiple nodes is now possible
without posing a risk to cluster health. Recommissioning a node does
no longer require a restart of the target nodes to take effect.

@tbg tbg requested a review from a team August 16, 2018 15:30
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@tbg tbg force-pushed the fix/decommission-quiesce branch from 068f8f9 to 89a2852 Compare August 16, 2018 16:48
@tbg tbg requested a review from nvb August 16, 2018 16:49
Copy link
Copy Markdown
Contributor

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: but I think you mean "don't drain when decommissioning" in the commit/PR title.

Reviewed 3 of 3 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/acceptance/decommission_test.go, line 284 at r1 (raw file):

		exp := [][]string{
			decommissionHeader,
			// Expect the same usual, except this time the node should be draining

s/same usual/same as usual/

@tbg tbg changed the title server: don't quiesce when decommissioning server: don't drain when decommissioning Aug 16, 2018
Prior to this commit, a server which enters decommissioning would
automatically drain. This wasn't a great idea since it meant that
pointing the decommissioning command at a set of nodes required
for the cluster to work would brick that cluster, and would be
difficult to recover from (since the draining state is picked
up by the nodes when they start).

Instead, decouple draining from decommissioning. Decommissioning simply
tells the allocator to move data off that node; when the node gets shut
down cleanly (the operator's responsiblity), it will drain.

The resulting behavior when trying to decommission a too-large set of
nodes is now that the decommission command will simply not finish.
For example, starting a three node cluster and decommissioning all
nodes will simply hang (though the nodes will be marked as
decommissioning). Add three more nodes to the cluster and replicas
will move over to the newly added nodes, and the decommissioning
command will finish.

As a bonus, recommissioning a node now doesn't require the target
node to restart.

As a second bonus, the decommissioning acceptance test now takes
around 60% of the previous time (~45s down from 70+).

One remaining caveat is that users may forget that they attempted
to decommission nodes. We need to check that we prominently alert
in the UI when nodes are decommissioning. This isn't a new problem.

Fixes cockroachdb#27444.
Fixes cockroachdb#27025.

Release note (bug fix): decommissioning multiple nodes is now possible
without posing a risk to cluster health. Recommissioning a node does
no longer require a restart of the target nodes to take effect.
@tbg tbg force-pushed the fix/decommission-quiesce branch from 89a2852 to f7c66c1 Compare August 16, 2018 19:48
@tbg
Copy link
Copy Markdown
Member Author

tbg commented Aug 16, 2018

Changed both, TFTR!

bors r=nvanbenschoten

craig bot pushed a commit that referenced this pull request Aug 16, 2018
28684: storage: make subsumption atomic with merge commit application r=bdarnell,tschottdorf a=benesch

~Please review the first commit separately in #28661.~

During a merge, the subsumed range logically ceases to exist at the
moment that the merge transaction commit applies. It's important that we
remove the subsumed range from disk in the same batch as the merge
commit to protect against an ill-timed crash. Tease apart
Replica.destroyRaftMuLocked to make this possible.

Release note: None

28707: server: don't drain when decommissioning r=nvanbenschoten a=tschottdorf

Prior to this commit, a server which enters decommissioning would
automatically drain. This wasn't a great idea since it meant that
pointing the decommissioning command at a set of nodes required
for the cluster to work would brick that cluster, and would be
difficult to recover from (since the draining state is picked
up by the nodes when they start).

Instead, decouple draining from decommissioning. Decommissioning simply
tells the allocator to move data off that node; when the node gets shut
down cleanly (the operator's responsiblity), it will drain.

The resulting behavior when trying to decommission a too-large set of
nodes is now that the decommission command will simply not finish.
For example, starting a three node cluster and decommissioning all
nodes will simply hang (though the nodes will be marked as
decommissioning). Add three more nodes to the cluster and replicas
will move over to the newly added nodes, and the decommissioning
command will finish.

As a bonus, recommissioning a node now doesn't require the target
node to restart.

As a second bonus, the decommissioning acceptance test now takes
around 60% of the previous time (~45s down from 70+).

One remaining caveat is that users may forget that they attempted
to decommission nodes. We need to check that we prominently alert
in the UI when nodes are decommissioning. This isn't a new problem.

Fixes #27444.
Fixes #27025.

Release note (bug fix): decommissioning multiple nodes is now possible
without posing a risk to cluster health. Recommissioning a node does
no longer require a restart of the target nodes to take effect.

Co-authored-by: Nikhil Benesch <nikhil.benesch@gmail.com>
Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
@craig
Copy link
Copy Markdown
Contributor

craig bot commented Aug 16, 2018

Build succeeded

@craig craig bot merged commit f7c66c1 into cockroachdb:master Aug 16, 2018
@tbg tbg deleted the fix/decommission-quiesce branch August 20, 2018 13:44
@tbg tbg added the docs-todo label Aug 22, 2018
tbg added a commit to tbg/cockroach that referenced this pull request Sep 18, 2018
This is somehow the salient part of cockroachdb#28707 that I managed to miss
during the initial backport of that PR: we don't want to set the
node to draining when it is being decommissioned.

Note that the /health?ready=1 endpoint will continue to return a
503 error once decommissioning starts due to a readiness check
(something I am preparing a PR against master for).

Release note: None
tbg added a commit to tbg/cockroach that referenced this pull request Sep 25, 2018
This is somehow the salient part of cockroachdb#28707 that I managed to miss
during the initial backport of that PR: we don't want to set the
node to draining when it is being decommissioned.

Note that the /health?ready=1 endpoint will continue to return a
503 error once decommissioning starts due to a readiness check
(something I am preparing a PR against master for).

Release note: None

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

storage: Cluster can't recover if all nodes are decommissioned and restarted server: decommissioning with replication factor one causes outage

3 participants