-
Notifications
You must be signed in to change notification settings - Fork 18.9k
Automatic service rollback on failure #31108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
367c595 to
96b1ef6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add "|rollback"?
📝
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, thanks
4d498a6 to
9b1bbf9
Compare
9b1bbf9 to
9b99e39
Compare
|
Rebased |
api/swagger.yaml
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
=> Amount of time between rollback iterations,
daemon/cluster/convert/service.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the value from GRPC is unrecognized, there is nothing we can really do here. We could return an error, but I think that would be too extreme. This is not changing the old behavior, just reorganizing the code and adding a new case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason this PR is mixed with rollback change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was causing problems testing the change. I can break it out to a separate PR if you want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine if this will be merged soon. This PR might be cherry-picked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree we should not block the shutdown because a failure in the deactivate service binding, unless user is given a command to fix it. I was just looking at it, while I came to know from @dongluochen this was taken care in this PR as a side change.
Given this PR may take some time to go through, I am also thinking it would be good to have a separate PR which can be quickly merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pulled this part out into #31279
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(pause|continue|rollback) => (pause|continue)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My mistake, this was a rebase issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no rollback on rollback failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My mistake, this was a rebase issue.
9b99e39 to
d874a3d
Compare
|
@dongluochen: Addressed your comments, thanks. |
|
LGTM |
|
LGTM |
d874a3d to
014be19
Compare
|
Rebased |
vdemeester
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🐯
|
Docs change LGTM. |
|
bash and zsh completion could be done in the follow up PRs. |
|
I think it might be good to add a simple example section in the docs, and Also, the PR needs a rebase. |
Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
Server-side rollback can take advantage of the rollback-specific update parameters, instead of being treated as a normal update that happens to go back to a previous version of the spec. Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
014be19 to
f9bd8ec
Compare
|
Thanks @yongtang. I've updated the API version history and added some content to |
|
LGTM. Thanks! |
|
/cc @albers @sdurrheimer for bash/zsh completion. |
This PR adds support for a "rollback" failure action. When this is selected, the orchestrator will automatically roll back to the previous version of a service if an update hits the failure threshold.
It adds the following options to control the behavior of rollbacks. They are kept separate from the existing
--update-*options, since there are situations where the behavior should be different for rollbacks. For example, an administrator may wish to roll out a new version of a service very slowly, but if failure thresholds are exceeded, switch back to the old version quickly. Arguably, we may be able to live with hardcoded defaults for some of these options, but it seemed better to stay consistent with the--update-*options than to introduce arbitrary differences.--rollback-delay--rollback-failure-action--rollback-max-failure-ratio--rollback-monitor--rollback-parallelismI'm open to reducing the set of options if desired, but at a minimum we should keep
--rollback-delayand--rollback-parallelism.This PR also switches the manually invoked
service update --rollbackto work on the server side, instead of being a purely client side operation. This allows manually-initiated rollbacks to respect the new rollback parameters, and it also has the benefits of avoiding some nasty hacks we previously used. The client is version-aware, so it will still use the old method against an older daemon.Note that
--rollbackhas been changed so that it can't be used in combination with otherservice updateflags. This is a compromise to avoid adding further hacks. While it's unfortunate to be restricting this functionality, it does make the implementation much cleaner.cc @aluzzardi @dongluochen