Skip to content

global: Allow updates of failed services with restart policy "none"#2288

Merged
aaronlehmann merged 1 commit intomoby:masterfrom
aaronlehmann:global-service-restart-none
Jul 20, 2017
Merged

global: Allow updates of failed services with restart policy "none"#2288
aaronlehmann merged 1 commit intomoby:masterfrom
aaronlehmann:global-service-restart-none

Conversation

@aaronlehmann
Copy link
Copy Markdown
Collaborator

There is some tension around what to do when reconciling a service whose restart policy is "none". On one hand, the restart policy says not to restart a failed task. On the other hand, the declarative state of the service says a certain number of replics of the service should run, or the task should run on all possible nodes, and a change to the spec might resolve the condition that caused the failure.

Currently, the replicated orchestrator chooses to favor the declarative state when reconciling a service. This means that on a leader change or change to the service, we effectively ignore the restart policy.

The global orchestrator, on the other hand, refuses to touch the task on any node where it has failed, if the restart policy says to do so. This can lead to very unintuitive behavior, for example setting up a service that fails because its command line is wrong, then updating the service to fix the command line, and seeing nothing happen.

I believe the replicated orchestrator's behavior is a better choice, and less confusing to users. It's very bad for a service to get into a state where it can't be updated because it has already failed on all nodes and therefore won't be touched on those nodes.

This changes the global orchestrator to behave like the replicated orchestrator, and ignore the restart policy on startup, and when the service is updated.

Note that this means that on leader changes, failed global services will be restarted, which might be contrary to expectations. But it also means that a "service update" will be able to fix a broken global service regardless of the restart policy, which is very important.

cc @aluzzardi
cc @dongluochen (in case you're still around)

@aaronlehmann aaronlehmann force-pushed the global-service-restart-none branch from f0b7992 to 3e610a6 Compare June 28, 2017 00:01
@codecov
Copy link
Copy Markdown

codecov bot commented Jun 28, 2017

Codecov Report

Merging #2288 into master will increase coverage by 0.02%.
The diff coverage is 0%.

@@            Coverage Diff             @@
##           master    #2288      +/-   ##
==========================================
+ Coverage   60.21%   60.24%   +0.02%     
==========================================
  Files         128      128              
  Lines       25994    25839     -155     
==========================================
- Hits        15653    15566      -87     
+ Misses       8955     8892      -63     
+ Partials     1386     1381       -5

Copy link
Copy Markdown
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - do we need a test for this?

I agree with this change; there's one thing that comes to mind; I know users use --restart-condition=none to run "one off" tasks. If they do not remove the task/service afterwards, this means it can be triggered again.

This may be a corner-case, so I think it's sufficient to just document this change in behaviour (at least make sure it's in the changelog)

Perhaps we should think of an --on-exit=remove option, that removes a service after the tasks have run and exited.

@aluzzardi @stevvooe any thoughts?

@stevvooe
Copy link
Copy Markdown
Contributor

LGTM

This might interfere with some uses of restart-none as a batch mode primitive, but I am not sure we should support that hack in lieu of a real batch solution.

@thaJeztah
Copy link
Copy Markdown
Member

@stevvooe agreed; I was referring to the same in my comment

@cyli
Copy link
Copy Markdown
Contributor

cyli commented Jul 19, 2017

Would it be possible to constrain restarting services on leader change if we check the task spec version (if it exists?) So if the task is failed, but the service spec version hasn't changed, then even on leader change that particular task doesn't get restarted?

Other than that, the behavior and code seem good to me, with the caveat that I'm not super familiar with this code. Perhaps it would be good to add a test case though? There don't seem to be any tests for RestartOnNone policies.

Update: although apologies, would this be affected by changes as a result of #932? That one seems to just concern the replicated orchestrator, but I just wanted to make sure the behavior between the two orchestrators would be consistent?

@aaronlehmann
Copy link
Copy Markdown
Collaborator Author

#932 affects both orchestrators. This PR makes the global orchestrator consistent with the replicated orchestrator, and whatever is done to fix #932 will have to change both in a consistent way. #2290 is a first attempt at this, which affects both orchestrators through the changes it makes to the restart supervisor. It also depends on the changes here for correctness.

Would it be possible to constrain restarting services on leader change if we check the task spec version (if it exists?) So if the task is failed, but the service spec version hasn't changed, then even on leader change that particular task doesn't get restarted?

Something like this is possible, but it depends on the failed task still existing. We've tried to avoid depending on that, because sometimes users use --task-history-limit 0 and it would be unexpected for that to influence orchestration behavior.

The solution in #2290 seems promising to me, but if it doesn't work out for some reason, this might be something to consider as an alternative.

@cyli
Copy link
Copy Markdown
Contributor

cyli commented Jul 19, 2017

@aaronlehmann Ah ok. Thanks for explaining.

@aaronlehmann aaronlehmann force-pushed the global-service-restart-none branch from a10088a to 9e1daf0 Compare July 19, 2017 22:05
@aaronlehmann
Copy link
Copy Markdown
Collaborator Author

I've added some test coverage, fixed the tests not to leak the orchestrator, and changed the global orchestrator to ignore task deletions (which would wrongly trigger a reconciliation).

testutils.Expect(t, watch, state.EventCommit{})
testutils.Expect(t, watch, api.EventUpdateTask{}) // ready->running
testutils.Expect(t, watch, state.EventCommit{})

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to also delete the task here, and assert that it doesn't get re-created, since that behavior has also changed?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added back TestRemoveTask, now expecting different behavior.

@cyli
Copy link
Copy Markdown
Contributor

cyli commented Jul 19, 2017

One minor test nit, otherwise LGTM!

@aaronlehmann aaronlehmann force-pushed the global-service-restart-none branch from 9e1daf0 to 8969945 Compare July 20, 2017 00:05
There is some tension around what to do when reconciling a service whose
restart policy is "none". On one hand, the restart policy says not to
restart a failed task. On the other hand, the declarative state of the
service says a certain number of replics of the service should run, or
the task should run on all possible nodes, and a change to the spec
might resolve the condition that caused the failure.

Currently, the replicated orchestrator chooses to favor the declarative
state when reconciling a service. This means that on a leader change or
change to the service, we effectively ignore the restart policy.

The global orchestrator, on the other hand, refuses to touch the task on
any node where it has failed, if the restart policy says to do so. This
can lead to very unintuitive behavior, for example setting up a service
that fails because its command line is wrong, then updating the service
to fix the command line, and seeing nothing happen.

I believe the replicated orchestrator's behavior is a better choice, and
less confusing to users. It's very bad for a service to get into a state
where it can't be updated because it has already failed on all nodes and
therefore won't be touched on those nodes.

This changes the global orchestrator to behave like the replicated
orchestrator, and ignore the restart policy on startup, and when the
service is updated.

Note that this means that on leader changes, failed global services will
be restarted, which might be contrary to expectations. But it also means
that a "service update" will be able to fix a broken global service
regardless of the restart policy, which is very important.

Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
@aaronlehmann aaronlehmann force-pushed the global-service-restart-none branch from 8969945 to 3fc9d0a Compare July 20, 2017 00:14
@aluzzardi
Copy link
Copy Markdown
Member

LGTM

@aaronlehmann aaronlehmann merged commit b42631a into moby:master Jul 20, 2017
@aaronlehmann aaronlehmann deleted the global-service-restart-none branch July 20, 2017 00:38
silvin-lubecki pushed a commit to silvin-lubecki/docker-ce that referenced this pull request Feb 3, 2020
- moby/swarmkit#2288 (Allow updates of failed services with restart policy "none")
- moby/swarmkit#2304 (Reset restart history when task spec changes)
- moby/swarmkit#2309 (updating the service spec version when rolling back)
- moby/swarmkit#2310 (fix for slow swarm shutdown)

Signed-off-by: Ying <ying.li@docker.com>
silvin-lubecki pushed a commit to silvin-lubecki/engine-extract that referenced this pull request Feb 3, 2020
- moby/swarmkit#2288 (Allow updates of failed services with restart policy "none")
- moby/swarmkit#2304 (Reset restart history when task spec changes)
- moby/swarmkit#2309 (updating the service spec version when rolling back)
- moby/swarmkit#2310 (fix for slow swarm shutdown)

Signed-off-by: Ying <ying.li@docker.com>
glours pushed a commit to silvin-lubecki/engine-extract that referenced this pull request Mar 11, 2020
- moby/swarmkit#2288 (Allow updates of failed services with restart policy "none")
- moby/swarmkit#2304 (Reset restart history when task spec changes)
- moby/swarmkit#2309 (updating the service spec version when rolling back)
- moby/swarmkit#2310 (fix for slow swarm shutdown)

Signed-off-by: Ying <ying.li@docker.com>
silvin-lubecki pushed a commit to silvin-lubecki/engine-extract that referenced this pull request Mar 17, 2020
- moby/swarmkit#2288 (Allow updates of failed services with restart policy "none")
- moby/swarmkit#2304 (Reset restart history when task spec changes)
- moby/swarmkit#2309 (updating the service spec version when rolling back)
- moby/swarmkit#2310 (fix for slow swarm shutdown)

Signed-off-by: Ying <ying.li@docker.com>
silvin-lubecki pushed a commit to silvin-lubecki/engine-extract that referenced this pull request Mar 23, 2020
- moby/swarmkit#2288 (Allow updates of failed services with restart policy "none")
- moby/swarmkit#2304 (Reset restart history when task spec changes)
- moby/swarmkit#2309 (updating the service spec version when rolling back)
- moby/swarmkit#2310 (fix for slow swarm shutdown)

Signed-off-by: Ying <ying.li@docker.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants