Skip to content

orchestrator/restart: Reset restart history when task spec changes#2304

Merged
aaronlehmann merged 3 commits intomoby:masterfrom
aaronlehmann:reset-restart-attempts
Jul 20, 2017
Merged

orchestrator/restart: Reset restart history when task spec changes#2304
aaronlehmann merged 3 commits intomoby:masterfrom
aaronlehmann:reset-restart-attempts

Conversation

@aaronlehmann
Copy link
Copy Markdown
Collaborator

When a service reaches the limit on the number of restart attempts and then it is updated, currently the limit is still enforced. This isn't the right behavior, because restarts before the service was updated shouldn't count against the new version. There may have been a problem with the old service spec that is fixed with the new spec.

See moby/moby#34007

@codecov
Copy link
Copy Markdown

codecov bot commented Jul 12, 2017

Codecov Report

Merging #2304 into master will decrease coverage by 0.06%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #2304      +/-   ##
==========================================
- Coverage   60.29%   60.23%   -0.07%     
==========================================
  Files         128      128              
  Lines       25974    25972       -2     
==========================================
- Hits        15661    15644      -17     
- Misses       8921     8947      +26     
+ Partials     1392     1381      -11

Copy link
Copy Markdown
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! Just reading through the code, left some comments but they're more ramblings on my side 😄

// be easy to clean up map entries corresponding to old specVersions.
// Making the key version-agnostic and clearing the value whenever the
// version changes avoids the issue of stale map entries for old
// versions.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 ❤️

}
r.historyByService[restartTask.ServiceID][tuple] = struct{}{}

restartInfo.totalRestarts++
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So do we track max attempted restarts for a task, not for a service? Does this mean that if I have a service with two replicas, and one of those replica's keeps failing, that the replica is no longer restarted, but the other replica is kept running? (i.e., the service running in degraded mode)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guess I didn't realise that (it's confusing which options apply to the service as a whole, and which ones apply to individual tasks.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's per "instance", so either per-replica or per-node depending whether it's a replicated or global service.

I believe the reasoning for this was that if you have one node that's broken or misbehaving and has tasks restarting in a loop, you don't want that to impact the ability to restart other tasks when they encounter occasional problems. By tracking restarts per-instance, the instances don't share a global maximum number of restarts. It seems more useful overall.

But I agree it can be confusing.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps a future addition; on failure; re-schedule task 🤔

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps a future addition; on failure; re-schedule task

We actually do have a mechanism that reschedules tasks after they have failed on the same node repeatedly: #1552

Of course, it only applies to replicated services.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL @mstanleyjones not sure if we documented that?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we have. It sounds familiar.


restartInfo := r.history[instanceTuple]
if restartInfo == nil {
restartInfo := r.historyByService[t.ServiceID][instanceTuple]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps I'm mis-interpreting the TODO at the top;

// TODO(aluzzardi): This function should not depend on `service`.

But doesn't this make it even more dependent on service? (I agree it's cleaner 😄)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the TODO is saying that service should not be passed in as a function parameter. I don't see how this function could work without being aware of the service ID.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was thinking if the TODO meant: instance id's are globally unique, thus we only need to track per instance

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we don't have a concept of an instance ID. We track instances as a combination of service ID, and either a slot number or node ID.

}

delete(r.historyByService, serviceID)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Wondering if this should also re-initialize the history

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is called when the service is deleted, so we want the map entry for it to be gone.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah makes sense

Copy link
Copy Markdown
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

ping @aluzzardi @cyli PTAL

@moby moby deleted a comment from GordonTheTurtle Jul 13, 2017
Copy link
Copy Markdown
Contributor

@cyli cyli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cyli
Copy link
Copy Markdown
Contributor

cyli commented Jul 13, 2017

(Should this go into 17.06.1?)

@thaJeztah
Copy link
Copy Markdown
Member

I don't think this is a regression since 17.03, so including in 17.06.2 would be ok with me. It's a good one to have though, as it improves the (perceived) stability of swarm mode.

(open for input though if this should be prioritised, and is able to make it in time)

Aaron Lehmann added 3 commits July 20, 2017 13:44
Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
It is not correct to count restarts of older versions of the service
against the restart limit.

Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
@aaronlehmann aaronlehmann force-pushed the reset-restart-attempts branch from d8ba7bb to d2e8152 Compare July 20, 2017 20:46
@aaronlehmann aaronlehmann merged commit ed13aa9 into moby:master Jul 20, 2017
@aaronlehmann aaronlehmann deleted the reset-restart-attempts branch July 20, 2017 21:04
silvin-lubecki pushed a commit to silvin-lubecki/docker-ce that referenced this pull request Feb 3, 2020
- moby/swarmkit#2288 (Allow updates of failed services with restart policy "none")
- moby/swarmkit#2304 (Reset restart history when task spec changes)
- moby/swarmkit#2309 (updating the service spec version when rolling back)
- moby/swarmkit#2310 (fix for slow swarm shutdown)

Signed-off-by: Ying <ying.li@docker.com>
silvin-lubecki pushed a commit to silvin-lubecki/engine-extract that referenced this pull request Feb 3, 2020
- moby/swarmkit#2288 (Allow updates of failed services with restart policy "none")
- moby/swarmkit#2304 (Reset restart history when task spec changes)
- moby/swarmkit#2309 (updating the service spec version when rolling back)
- moby/swarmkit#2310 (fix for slow swarm shutdown)

Signed-off-by: Ying <ying.li@docker.com>
glours pushed a commit to silvin-lubecki/engine-extract that referenced this pull request Mar 11, 2020
- moby/swarmkit#2288 (Allow updates of failed services with restart policy "none")
- moby/swarmkit#2304 (Reset restart history when task spec changes)
- moby/swarmkit#2309 (updating the service spec version when rolling back)
- moby/swarmkit#2310 (fix for slow swarm shutdown)

Signed-off-by: Ying <ying.li@docker.com>
silvin-lubecki pushed a commit to silvin-lubecki/engine-extract that referenced this pull request Mar 17, 2020
- moby/swarmkit#2288 (Allow updates of failed services with restart policy "none")
- moby/swarmkit#2304 (Reset restart history when task spec changes)
- moby/swarmkit#2309 (updating the service spec version when rolling back)
- moby/swarmkit#2310 (fix for slow swarm shutdown)

Signed-off-by: Ying <ying.li@docker.com>
silvin-lubecki pushed a commit to silvin-lubecki/engine-extract that referenced this pull request Mar 23, 2020
- moby/swarmkit#2288 (Allow updates of failed services with restart policy "none")
- moby/swarmkit#2304 (Reset restart history when task spec changes)
- moby/swarmkit#2309 (updating the service spec version when rolling back)
- moby/swarmkit#2310 (fix for slow swarm shutdown)

Signed-off-by: Ying <ying.li@docker.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants