global: Allow updates of failed services with restart policy "none"#2288
Conversation
f0b7992 to
3e610a6
Compare
Codecov Report
@@ Coverage Diff @@
## master #2288 +/- ##
==========================================
+ Coverage 60.21% 60.24% +0.02%
==========================================
Files 128 128
Lines 25994 25839 -155
==========================================
- Hits 15653 15566 -87
+ Misses 8955 8892 -63
+ Partials 1386 1381 -5 |
3e610a6 to
a10088a
Compare
thaJeztah
left a comment
There was a problem hiding this comment.
LGTM - do we need a test for this?
I agree with this change; there's one thing that comes to mind; I know users use --restart-condition=none to run "one off" tasks. If they do not remove the task/service afterwards, this means it can be triggered again.
This may be a corner-case, so I think it's sufficient to just document this change in behaviour (at least make sure it's in the changelog)
Perhaps we should think of an --on-exit=remove option, that removes a service after the tasks have run and exited.
@aluzzardi @stevvooe any thoughts?
|
LGTM This might interfere with some uses of restart-none as a batch mode primitive, but I am not sure we should support that hack in lieu of a real batch solution. |
|
@stevvooe agreed; I was referring to the same in my comment |
|
Would it be possible to constrain restarting services on leader change if we check the task spec version (if it exists?) So if the task is failed, but the service spec version hasn't changed, then even on leader change that particular task doesn't get restarted? Other than that, the behavior and code seem good to me, with the caveat that I'm not super familiar with this code. Perhaps it would be good to add a test case though? There don't seem to be any tests for Update: although apologies, would this be affected by changes as a result of #932? That one seems to just concern the replicated orchestrator, but I just wanted to make sure the behavior between the two orchestrators would be consistent? |
|
#932 affects both orchestrators. This PR makes the global orchestrator consistent with the replicated orchestrator, and whatever is done to fix #932 will have to change both in a consistent way. #2290 is a first attempt at this, which affects both orchestrators through the changes it makes to the restart supervisor. It also depends on the changes here for correctness.
Something like this is possible, but it depends on the failed task still existing. We've tried to avoid depending on that, because sometimes users use The solution in #2290 seems promising to me, but if it doesn't work out for some reason, this might be something to consider as an alternative. |
|
@aaronlehmann Ah ok. Thanks for explaining. |
a10088a to
9e1daf0
Compare
|
I've added some test coverage, fixed the tests not to leak the orchestrator, and changed the global orchestrator to ignore task deletions (which would wrongly trigger a reconciliation). |
| testutils.Expect(t, watch, state.EventCommit{}) | ||
| testutils.Expect(t, watch, api.EventUpdateTask{}) // ready->running | ||
| testutils.Expect(t, watch, state.EventCommit{}) | ||
|
|
There was a problem hiding this comment.
Would it make sense to also delete the task here, and assert that it doesn't get re-created, since that behavior has also changed?
There was a problem hiding this comment.
Added back TestRemoveTask, now expecting different behavior.
|
One minor test nit, otherwise LGTM! |
9e1daf0 to
8969945
Compare
There is some tension around what to do when reconciling a service whose restart policy is "none". On one hand, the restart policy says not to restart a failed task. On the other hand, the declarative state of the service says a certain number of replics of the service should run, or the task should run on all possible nodes, and a change to the spec might resolve the condition that caused the failure. Currently, the replicated orchestrator chooses to favor the declarative state when reconciling a service. This means that on a leader change or change to the service, we effectively ignore the restart policy. The global orchestrator, on the other hand, refuses to touch the task on any node where it has failed, if the restart policy says to do so. This can lead to very unintuitive behavior, for example setting up a service that fails because its command line is wrong, then updating the service to fix the command line, and seeing nothing happen. I believe the replicated orchestrator's behavior is a better choice, and less confusing to users. It's very bad for a service to get into a state where it can't be updated because it has already failed on all nodes and therefore won't be touched on those nodes. This changes the global orchestrator to behave like the replicated orchestrator, and ignore the restart policy on startup, and when the service is updated. Note that this means that on leader changes, failed global services will be restarted, which might be contrary to expectations. But it also means that a "service update" will be able to fix a broken global service regardless of the restart policy, which is very important. Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
8969945 to
3fc9d0a
Compare
|
LGTM |
- moby/swarmkit#2288 (Allow updates of failed services with restart policy "none") - moby/swarmkit#2304 (Reset restart history when task spec changes) - moby/swarmkit#2309 (updating the service spec version when rolling back) - moby/swarmkit#2310 (fix for slow swarm shutdown) Signed-off-by: Ying <ying.li@docker.com>
- moby/swarmkit#2288 (Allow updates of failed services with restart policy "none") - moby/swarmkit#2304 (Reset restart history when task spec changes) - moby/swarmkit#2309 (updating the service spec version when rolling back) - moby/swarmkit#2310 (fix for slow swarm shutdown) Signed-off-by: Ying <ying.li@docker.com>
- moby/swarmkit#2288 (Allow updates of failed services with restart policy "none") - moby/swarmkit#2304 (Reset restart history when task spec changes) - moby/swarmkit#2309 (updating the service spec version when rolling back) - moby/swarmkit#2310 (fix for slow swarm shutdown) Signed-off-by: Ying <ying.li@docker.com>
- moby/swarmkit#2288 (Allow updates of failed services with restart policy "none") - moby/swarmkit#2304 (Reset restart history when task spec changes) - moby/swarmkit#2309 (updating the service spec version when rolling back) - moby/swarmkit#2310 (fix for slow swarm shutdown) Signed-off-by: Ying <ying.li@docker.com>
- moby/swarmkit#2288 (Allow updates of failed services with restart policy "none") - moby/swarmkit#2304 (Reset restart history when task spec changes) - moby/swarmkit#2309 (updating the service spec version when rolling back) - moby/swarmkit#2310 (fix for slow swarm shutdown) Signed-off-by: Ying <ying.li@docker.com>
There is some tension around what to do when reconciling a service whose restart policy is "none". On one hand, the restart policy says not to restart a failed task. On the other hand, the declarative state of the service says a certain number of replics of the service should run, or the task should run on all possible nodes, and a change to the spec might resolve the condition that caused the failure.
Currently, the replicated orchestrator chooses to favor the declarative state when reconciling a service. This means that on a leader change or change to the service, we effectively ignore the restart policy.
The global orchestrator, on the other hand, refuses to touch the task on any node where it has failed, if the restart policy says to do so. This can lead to very unintuitive behavior, for example setting up a service that fails because its command line is wrong, then updating the service to fix the command line, and seeing nothing happen.
I believe the replicated orchestrator's behavior is a better choice, and less confusing to users. It's very bad for a service to get into a state where it can't be updated because it has already failed on all nodes and therefore won't be touched on those nodes.
This changes the global orchestrator to behave like the replicated orchestrator, and ignore the restart policy on startup, and when the service is updated.
Note that this means that on leader changes, failed global services will be restarted, which might be contrary to expectations. But it also means that a "service update" will be able to fix a broken global service regardless of the restart policy, which is very important.
cc @aluzzardi
cc @dongluochen (in case you're still around)