orchestrator/restart: Reset restart history when task spec changes#2304
orchestrator/restart: Reset restart history when task spec changes#2304aaronlehmann merged 3 commits intomoby:masterfrom
Conversation
4d35dc2 to
d8ba7bb
Compare
Codecov Report
@@ Coverage Diff @@
## master #2304 +/- ##
==========================================
- Coverage 60.29% 60.23% -0.07%
==========================================
Files 128 128
Lines 25974 25972 -2
==========================================
- Hits 15661 15644 -17
- Misses 8921 8947 +26
+ Partials 1392 1381 -11 |
thaJeztah
left a comment
There was a problem hiding this comment.
thanks! Just reading through the code, left some comments but they're more ramblings on my side 😄
| // be easy to clean up map entries corresponding to old specVersions. | ||
| // Making the key version-agnostic and clearing the value whenever the | ||
| // version changes avoids the issue of stale map entries for old | ||
| // versions. |
| } | ||
| r.historyByService[restartTask.ServiceID][tuple] = struct{}{} | ||
|
|
||
| restartInfo.totalRestarts++ |
There was a problem hiding this comment.
So do we track max attempted restarts for a task, not for a service? Does this mean that if I have a service with two replicas, and one of those replica's keeps failing, that the replica is no longer restarted, but the other replica is kept running? (i.e., the service running in degraded mode)
There was a problem hiding this comment.
Guess I didn't realise that (it's confusing which options apply to the service as a whole, and which ones apply to individual tasks.
There was a problem hiding this comment.
It's per "instance", so either per-replica or per-node depending whether it's a replicated or global service.
I believe the reasoning for this was that if you have one node that's broken or misbehaving and has tasks restarting in a loop, you don't want that to impact the ability to restart other tasks when they encounter occasional problems. By tracking restarts per-instance, the instances don't share a global maximum number of restarts. It seems more useful overall.
But I agree it can be confusing.
There was a problem hiding this comment.
Perhaps a future addition; on failure; re-schedule task 🤔
There was a problem hiding this comment.
Perhaps a future addition; on failure; re-schedule task
We actually do have a mechanism that reschedules tasks after they have failed on the same node repeatedly: #1552
Of course, it only applies to replicated services.
There was a problem hiding this comment.
TIL @mstanleyjones not sure if we documented that?
There was a problem hiding this comment.
I believe we have. It sounds familiar.
|
|
||
| restartInfo := r.history[instanceTuple] | ||
| if restartInfo == nil { | ||
| restartInfo := r.historyByService[t.ServiceID][instanceTuple] |
There was a problem hiding this comment.
Perhaps I'm mis-interpreting the TODO at the top;
// TODO(aluzzardi): This function should not depend on `service`.
But doesn't this make it even more dependent on service? (I agree it's cleaner 😄)
There was a problem hiding this comment.
I believe the TODO is saying that service should not be passed in as a function parameter. I don't see how this function could work without being aware of the service ID.
There was a problem hiding this comment.
Was thinking if the TODO meant: instance id's are globally unique, thus we only need to track per instance
There was a problem hiding this comment.
No, we don't have a concept of an instance ID. We track instances as a combination of service ID, and either a slot number or node ID.
| } | ||
|
|
||
| delete(r.historyByService, serviceID) | ||
|
|
There was a problem hiding this comment.
nit: Wondering if this should also re-initialize the history
There was a problem hiding this comment.
This is called when the service is deleted, so we want the map entry for it to be gone.
thaJeztah
left a comment
There was a problem hiding this comment.
LGTM
ping @aluzzardi @cyli PTAL
|
(Should this go into 17.06.1?) |
|
I don't think this is a regression since 17.03, so including in 17.06.2 would be ok with me. It's a good one to have though, as it improves the (perceived) stability of swarm mode. (open for input though if this should be prioritised, and is able to make it in time) |
Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
It is not correct to count restarts of older versions of the service against the restart limit. Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
d8ba7bb to
d2e8152
Compare
- moby/swarmkit#2288 (Allow updates of failed services with restart policy "none") - moby/swarmkit#2304 (Reset restart history when task spec changes) - moby/swarmkit#2309 (updating the service spec version when rolling back) - moby/swarmkit#2310 (fix for slow swarm shutdown) Signed-off-by: Ying <ying.li@docker.com>
- moby/swarmkit#2288 (Allow updates of failed services with restart policy "none") - moby/swarmkit#2304 (Reset restart history when task spec changes) - moby/swarmkit#2309 (updating the service spec version when rolling back) - moby/swarmkit#2310 (fix for slow swarm shutdown) Signed-off-by: Ying <ying.li@docker.com>
- moby/swarmkit#2288 (Allow updates of failed services with restart policy "none") - moby/swarmkit#2304 (Reset restart history when task spec changes) - moby/swarmkit#2309 (updating the service spec version when rolling back) - moby/swarmkit#2310 (fix for slow swarm shutdown) Signed-off-by: Ying <ying.li@docker.com>
- moby/swarmkit#2288 (Allow updates of failed services with restart policy "none") - moby/swarmkit#2304 (Reset restart history when task spec changes) - moby/swarmkit#2309 (updating the service spec version when rolling back) - moby/swarmkit#2310 (fix for slow swarm shutdown) Signed-off-by: Ying <ying.li@docker.com>
- moby/swarmkit#2288 (Allow updates of failed services with restart policy "none") - moby/swarmkit#2304 (Reset restart history when task spec changes) - moby/swarmkit#2309 (updating the service spec version when rolling back) - moby/swarmkit#2310 (fix for slow swarm shutdown) Signed-off-by: Ying <ying.li@docker.com>
When a service reaches the limit on the number of restart attempts and then it is updated, currently the limit is still enforced. This isn't the right behavior, because restarts before the service was updated shouldn't count against the new version. There may have been a problem with the old service spec that is fixed with the new spec.
See moby/moby#34007