global: Allow updates of failed services with restart policy "none" by aaronlehmann · Pull Request #2288 · moby/swarmkit

aaronlehmann · 2017-06-27T22:30:49Z

There is some tension around what to do when reconciling a service whose restart policy is "none". On one hand, the restart policy says not to restart a failed task. On the other hand, the declarative state of the service says a certain number of replics of the service should run, or the task should run on all possible nodes, and a change to the spec might resolve the condition that caused the failure.

Currently, the replicated orchestrator chooses to favor the declarative state when reconciling a service. This means that on a leader change or change to the service, we effectively ignore the restart policy.

The global orchestrator, on the other hand, refuses to touch the task on any node where it has failed, if the restart policy says to do so. This can lead to very unintuitive behavior, for example setting up a service that fails because its command line is wrong, then updating the service to fix the command line, and seeing nothing happen.

I believe the replicated orchestrator's behavior is a better choice, and less confusing to users. It's very bad for a service to get into a state where it can't be updated because it has already failed on all nodes and therefore won't be touched on those nodes.

This changes the global orchestrator to behave like the replicated orchestrator, and ignore the restart policy on startup, and when the service is updated.

Note that this means that on leader changes, failed global services will be restarted, which might be contrary to expectations. But it also means that a "service update" will be able to fix a broken global service regardless of the restart policy, which is very important.

cc @aluzzardi
cc @dongluochen (in case you're still around)

codecov · 2017-06-28T00:12:16Z

Codecov Report

Merging #2288 into master will increase coverage by 0.02%.
The diff coverage is 0%.

@@            Coverage Diff             @@
##           master    #2288      +/-   ##
==========================================
+ Coverage   60.21%   60.24%   +0.02%     
==========================================
  Files         128      128              
  Lines       25994    25839     -155     
==========================================
- Hits        15653    15566      -87     
+ Misses       8955     8892      -63     
+ Partials     1386     1381       -5

thaJeztah

LGTM - do we need a test for this?

I agree with this change; there's one thing that comes to mind; I know users use --restart-condition=none to run "one off" tasks. If they do not remove the task/service afterwards, this means it can be triggered again.

This may be a corner-case, so I think it's sufficient to just document this change in behaviour (at least make sure it's in the changelog)

Perhaps we should think of an --on-exit=remove option, that removes a service after the tasks have run and exited.

@aluzzardi @stevvooe any thoughts?

stevvooe · 2017-07-13T23:35:53Z

LGTM

This might interfere with some uses of restart-none as a batch mode primitive, but I am not sure we should support that hack in lieu of a real batch solution.

thaJeztah · 2017-07-14T00:14:03Z

@stevvooe agreed; I was referring to the same in my comment

cyli · 2017-07-19T21:03:13Z

Would it be possible to constrain restarting services on leader change if we check the task spec version (if it exists?) So if the task is failed, but the service spec version hasn't changed, then even on leader change that particular task doesn't get restarted?

Other than that, the behavior and code seem good to me, with the caveat that I'm not super familiar with this code. Perhaps it would be good to add a test case though? There don't seem to be any tests for RestartOnNone policies.

Update: although apologies, would this be affected by changes as a result of #932? That one seems to just concern the replicated orchestrator, but I just wanted to make sure the behavior between the two orchestrators would be consistent?

aaronlehmann · 2017-07-19T21:19:34Z

#932 affects both orchestrators. This PR makes the global orchestrator consistent with the replicated orchestrator, and whatever is done to fix #932 will have to change both in a consistent way. #2290 is a first attempt at this, which affects both orchestrators through the changes it makes to the restart supervisor. It also depends on the changes here for correctness.

Would it be possible to constrain restarting services on leader change if we check the task spec version (if it exists?) So if the task is failed, but the service spec version hasn't changed, then even on leader change that particular task doesn't get restarted?

Something like this is possible, but it depends on the failed task still existing. We've tried to avoid depending on that, because sometimes users use --task-history-limit 0 and it would be unexpected for that to influence orchestration behavior.

The solution in #2290 seems promising to me, but if it doesn't work out for some reason, this might be something to consider as an alternative.

cyli · 2017-07-19T21:26:06Z

@aaronlehmann Ah ok. Thanks for explaining.

aaronlehmann · 2017-07-19T22:06:23Z

I've added some test coverage, fixed the tests not to leak the orchestrator, and changed the global orchestrator to ignore task deletions (which would wrongly trigger a reconciliation).

cyli · 2017-07-19T23:19:30Z

manager/orchestrator/global/global_test.go

+	testutils.Expect(t, watch, state.EventCommit{})
+	testutils.Expect(t, watch, api.EventUpdateTask{}) // ready->running
+	testutils.Expect(t, watch, state.EventCommit{})
+


Would it make sense to also delete the task here, and assert that it doesn't get re-created, since that behavior has also changed?

Added back TestRemoveTask, now expecting different behavior.

cyli · 2017-07-19T23:19:47Z

One minor test nit, otherwise LGTM!

There is some tension around what to do when reconciling a service whose restart policy is "none". On one hand, the restart policy says not to restart a failed task. On the other hand, the declarative state of the service says a certain number of replics of the service should run, or the task should run on all possible nodes, and a change to the spec might resolve the condition that caused the failure. Currently, the replicated orchestrator chooses to favor the declarative state when reconciling a service. This means that on a leader change or change to the service, we effectively ignore the restart policy. The global orchestrator, on the other hand, refuses to touch the task on any node where it has failed, if the restart policy says to do so. This can lead to very unintuitive behavior, for example setting up a service that fails because its command line is wrong, then updating the service to fix the command line, and seeing nothing happen. I believe the replicated orchestrator's behavior is a better choice, and less confusing to users. It's very bad for a service to get into a state where it can't be updated because it has already failed on all nodes and therefore won't be touched on those nodes. This changes the global orchestrator to behave like the replicated orchestrator, and ignore the restart policy on startup, and when the service is updated. Note that this means that on leader changes, failed global services will be restarted, which might be contrary to expectations. But it also means that a "service update" will be able to fix a broken global service regardless of the restart policy, which is very important. Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>

aluzzardi · 2017-07-20T00:26:06Z

LGTM

- moby/swarmkit#2288 (Allow updates of failed services with restart policy "none") - moby/swarmkit#2304 (Reset restart history when task spec changes) - moby/swarmkit#2309 (updating the service spec version when rolling back) - moby/swarmkit#2310 (fix for slow swarm shutdown) Signed-off-by: Ying <ying.li@docker.com>

aaronlehmann force-pushed the global-service-restart-none branch from f0b7992 to 3e610a6 Compare June 28, 2017 00:01

aaronlehmann mentioned this pull request Jul 10, 2017

Wrong number of expected replicas with daemon kill and restart moby/moby#34001

Open

aaronlehmann force-pushed the global-service-restart-none branch from 3e610a6 to a10088a Compare July 13, 2017 00:59

thaJeztah approved these changes Jul 13, 2017

View reviewed changes

thaJeztah added the impact/changelog label Jul 13, 2017

stevvooe approved these changes Jul 13, 2017

View reviewed changes

aaronlehmann force-pushed the global-service-restart-none branch from a10088a to 9e1daf0 Compare July 19, 2017 22:05

cyli reviewed Jul 19, 2017

View reviewed changes

aaronlehmann force-pushed the global-service-restart-none branch from 9e1daf0 to 8969945 Compare July 20, 2017 00:05

aaronlehmann force-pushed the global-service-restart-none branch from 8969945 to 3fc9d0a Compare July 20, 2017 00:14

aaronlehmann merged commit b42631a into moby:master Jul 20, 2017

aaronlehmann deleted the global-service-restart-none branch July 20, 2017 00:38

cyli mentioned this pull request Jul 25, 2017

[17.07] Re-vendor swarmkit docker-archive/docker-ce#136

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

global: Allow updates of failed services with restart policy "none"#2288

global: Allow updates of failed services with restart policy "none"#2288
aaronlehmann merged 1 commit intomoby:masterfrom
aaronlehmann:global-service-restart-none

aaronlehmann commented Jun 27, 2017

Uh oh!

codecov bot commented Jun 28, 2017 •

edited

Loading

Uh oh!

thaJeztah left a comment

Uh oh!

stevvooe commented Jul 13, 2017

Uh oh!

thaJeztah commented Jul 14, 2017

Uh oh!

cyli commented Jul 19, 2017 •

edited

Loading

Uh oh!

aaronlehmann commented Jul 19, 2017

Uh oh!

cyli commented Jul 19, 2017

Uh oh!

aaronlehmann commented Jul 19, 2017

Uh oh!

cyli Jul 19, 2017

Uh oh!

aaronlehmann Jul 20, 2017

Uh oh!

cyli commented Jul 19, 2017

Uh oh!

aluzzardi commented Jul 20, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

aaronlehmann commented Jun 27, 2017

Uh oh!

codecov bot commented Jun 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

thaJeztah left a comment

Choose a reason for hiding this comment

Uh oh!

stevvooe commented Jul 13, 2017

Uh oh!

thaJeztah commented Jul 14, 2017

Uh oh!

cyli commented Jul 19, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aaronlehmann commented Jul 19, 2017

Uh oh!

cyli commented Jul 19, 2017

Uh oh!

aaronlehmann commented Jul 19, 2017

Uh oh!

cyli Jul 19, 2017

Choose a reason for hiding this comment

Uh oh!

aaronlehmann Jul 20, 2017

Choose a reason for hiding this comment

Uh oh!

cyli commented Jul 19, 2017

Uh oh!

aluzzardi commented Jul 20, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

codecov bot commented Jun 28, 2017 •

edited

Loading

cyli commented Jul 19, 2017 •

edited

Loading