orchestrator/restart: Reset restart history when task spec changes by aaronlehmann · Pull Request #2304 · moby/swarmkit

aaronlehmann · 2017-07-12T23:18:12Z

When a service reaches the limit on the number of restart attempts and then it is updated, currently the limit is still enforced. This isn't the right behavior, because restarts before the service was updated shouldn't count against the new version. There may have been a problem with the old service spec that is fixed with the new spec.

See moby/moby#34007

codecov · 2017-07-12T23:28:38Z

Codecov Report

Merging #2304 into master will decrease coverage by 0.06%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #2304      +/-   ##
==========================================
- Coverage   60.29%   60.23%   -0.07%     
==========================================
  Files         128      128              
  Lines       25974    25972       -2     
==========================================
- Hits        15661    15644      -17     
- Misses       8921     8947      +26     
+ Partials     1392     1381      -11

thaJeztah

thanks! Just reading through the code, left some comments but they're more ramblings on my side 😄

thaJeztah · 2017-07-12T23:21:04Z

manager/orchestrator/restart/restart.go

+	// be easy to clean up map entries corresponding to old specVersions.
+	// Making the key version-agnostic and clearing the value whenever the
+	// version changes avoids the issue of stale map entries for old
+	// versions.


👍 ❤️

thaJeztah · 2017-07-12T23:30:10Z

manager/orchestrator/restart/restart.go

 	}
-	r.historyByService[restartTask.ServiceID][tuple] = struct{}{}
+
+	restartInfo.totalRestarts++


So do we track max attempted restarts for a task, not for a service? Does this mean that if I have a service with two replicas, and one of those replica's keeps failing, that the replica is no longer restarted, but the other replica is kept running? (i.e., the service running in degraded mode)

Guess I didn't realise that (it's confusing which options apply to the service as a whole, and which ones apply to individual tasks.

It's per "instance", so either per-replica or per-node depending whether it's a replicated or global service.

I believe the reasoning for this was that if you have one node that's broken or misbehaving and has tasks restarting in a loop, you don't want that to impact the ability to restart other tasks when they encounter occasional problems. By tracking restarts per-instance, the instances don't share a global maximum number of restarts. It seems more useful overall.

But I agree it can be confusing.

Perhaps a future addition; on failure; re-schedule task 🤔

Perhaps a future addition; on failure; re-schedule task

We actually do have a mechanism that reschedules tasks after they have failed on the same node repeatedly: #1552

Of course, it only applies to replicated services.

TIL @mstanleyjones not sure if we documented that?

I believe we have. It sounds familiar.

thaJeztah · 2017-07-12T23:35:25Z

manager/orchestrator/restart/restart.go


-	restartInfo := r.history[instanceTuple]
-	if restartInfo == nil {
+	restartInfo := r.historyByService[t.ServiceID][instanceTuple]


Perhaps I'm mis-interpreting the TODO at the top;

// TODO(aluzzardi): This function should not depend on `service`.

But doesn't this make it even more dependent on service? (I agree it's cleaner 😄)

I believe the TODO is saying that service should not be passed in as a function parameter. I don't see how this function could work without being aware of the service ID.

Was thinking if the TODO meant: instance id's are globally unique, thus we only need to track per instance

No, we don't have a concept of an instance ID. We track instances as a combination of service ID, and either a slot number or node ID.

thaJeztah · 2017-07-12T23:36:56Z

manager/orchestrator/restart/restart.go

-	}
-
 	delete(r.historyByService, serviceID)
-


nit: Wondering if this should also re-initialize the history

This is called when the service is deleted, so we want the map entry for it to be gone.

ah makes sense

thaJeztah

LGTM

ping @aluzzardi @cyli PTAL

cyli

LGTM

cyli · 2017-07-13T02:07:40Z

(Should this go into 17.06.1?)

thaJeztah · 2017-07-13T02:59:34Z

I don't think this is a regression since 17.03, so including in 17.06.2 would be ok with me. It's a good one to have though, as it improves the (perceived) stability of swarm mode.

(open for input though if this should be prioritised, and is able to make it in time)

Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>

It is not correct to count restarts of older versions of the service against the restart limit. Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>

- moby/swarmkit#2288 (Allow updates of failed services with restart policy "none") - moby/swarmkit#2304 (Reset restart history when task spec changes) - moby/swarmkit#2309 (updating the service spec version when rolling back) - moby/swarmkit#2310 (fix for slow swarm shutdown) Signed-off-by: Ying <ying.li@docker.com>

GordonTheTurtle added the dco/no label Jul 12, 2017

aaronlehmann mentioned this pull request Jul 12, 2017

Resetting restart policy state moby/moby#34007

Closed

aaronlehmann force-pushed the reset-restart-attempts branch from 4d35dc2 to d8ba7bb Compare July 12, 2017 23:18

GordonTheTurtle removed the dco/no label Jul 12, 2017

thaJeztah reviewed Jul 12, 2017

View reviewed changes

thaJeztah approved these changes Jul 13, 2017

View reviewed changes

moby deleted a comment from GordonTheTurtle Jul 13, 2017

cyli approved these changes Jul 13, 2017

View reviewed changes

cyli mentioned this pull request Jul 19, 2017

[WIP] Respect restart policy during service reconciliation #2290

Closed

Aaron Lehmann added 3 commits July 20, 2017 13:44

restart: Consolidate history maps

579686b

Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>

restart: Track SpecVersion in instance restart history

811761b

Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>

restart: Reset restart history when spec is updated

d2e8152

It is not correct to count restarts of older versions of the service against the restart limit. Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>

aaronlehmann force-pushed the reset-restart-attempts branch from d8ba7bb to d2e8152 Compare July 20, 2017 20:46

aaronlehmann merged commit ed13aa9 into moby:master Jul 20, 2017

aaronlehmann deleted the reset-restart-attempts branch July 20, 2017 21:04

cyli mentioned this pull request Jul 25, 2017

[17.07] Re-vendor swarmkit docker-archive/docker-ce#136

Merged

Conversation

aaronlehmann commented Jul 12, 2017

Uh oh!

codecov bot commented Jul 12, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

thaJeztah left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thaJeztah left a comment

Choose a reason for hiding this comment

Uh oh!

cyli left a comment

Choose a reason for hiding this comment

Uh oh!

cyli commented Jul 13, 2017

Uh oh!

thaJeztah commented Jul 13, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

codecov bot commented Jul 12, 2017 •

edited

Loading